文章/答案/技术大牛

发布

社区首页 >问答首页 >如何在集群内进行集群

问如何在集群内进行集群
EN

Stack Overflow用户

提问于 2018-04-11 20:35:45

回答 2查看 594关注 0票数 4

我在地图上有一组点，每个点都有一个给定的参数值。我想：

在空间上对它们进行聚类，忽略少于10个点的集群。对于集群，我的df应该有一个列(Clust)，每个点都属于完成。
子集群每个集群中的参数值；在我的df (subClust)中添加一个列，用于按子集群对每个点进行分类。

我不知道怎么做第二部分，除了循环。

图像显示一组空间分布点(左上)颜色，按聚类编码，并在右上角图中按参数值排序。下面一行显示按参数值(右)排序的每个集群的>10分(左)和面。正是这些方面，我希望能够根据最小的簇分离距离(d=1)，按子簇来着色代码。

任何指点/帮助都很感激。我的可复制代码如下。

# TESTING
library(tidyverse)
library(gridExtra)

# Create a random (X, Y, Value) dataset
set.seed(36)
x_ex <- round(rnorm(200,50,20))
y_ex <- round(runif(200,0,85))
values <- rexp(200, 0.2)
df_ex <- data.frame(ID=1:length(y_ex),x=x_ex,y=y_ex,Test_Param=values)

# Cluster data by (X,Y) location
d = 4
chc <- hclust(dist(df_ex[,2:3]), method="single")

# Distance with a d threshold - used d=40 at one time but that changes...
chc.d40 <- cutree(chc, h=d) 
# max(chc.d40)

# Join results 
xy_df <- data.frame(df_ex, Clust=chc.d40)

# Plot results
breaks = max(chc.d40)
xy_df_filt <- xy_df %>% dplyr::group_by(Clust) %>% dplyr::mutate(n=n()) %>% dplyr::filter(n>10)# %>% nrow

p1 <- ggplot() +
  geom_point(data=xy_df, aes(x=x, y=y, colour = Clust)) +
  scale_color_gradientn(colours = rainbow(breaks)) +
  xlim(0,100) + ylim(0,100) 

p2 <- xy_df %>% dplyr::arrange(Test_Param) %>%
ggplot() +
  geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = Test_Param)) +
  scale_colour_gradient(low="red", high="green")

p3 <- ggplot() +
  geom_point(data=xy_df_filt, aes(x=x, y=y, colour = Clust)) +
  scale_color_gradientn(colours = rainbow(breaks)) +
  xlim(0,100) + ylim(0,100) 

p4 <- xy_df_filt %>% dplyr::arrange(Test_Param) %>%
ggplot() +
  geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = Test_Param)) +
  scale_colour_gradient(low="red", high="green") +
  facet_wrap(~Clust, scales="free")

grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)

这个片段不工作-不能在dplyr mutate()内输送.

# Second Hierarchical Clustering: Try to sub-cluster by Test_Param within the individual clusters I've already defined above
xy_df_filt %>% # This part does not work
  dplyr::group_by(Clust) %>% 
  dplyr::mutate(subClust = hclust(dist(.$Test_Param), method="single") %>% 
                  cutree(, h=1))

下面是一种使用循环的方法--但是我更愿意学习如何使用dplyr或其他一些非循环方法来完成这个任务。下面是显示子聚类面的更新图像。

sub_df <- data.frame()
for (i in unique(xy_df_filt$Clust)) {
  temp_df <- xy_df_filt %>% dplyr::filter(Clust == i)
  # Cluster data by (X,Y) location
  a_d = 1
  a_chc <- hclust(dist(temp_df$Test_Param), method="single")

  # Distance with a d threshold - used d=40 at one time but that changes... 
  a_chc.d40 <- cutree(a_chc, h=a_d) 
  # max(chc.d40)

  # Join results to main df
  sub_df <- bind_rows(sub_df, data.frame(temp_df, subClust=a_chc.d40)) %>% dplyr::select(ID, subClust)
}
xy_df_filt_2 <- left_join(xy_df_filt,sub_df, by=c("ID"="ID"))

p4 <- xy_df_filt_2 %>% dplyr::arrange(Test_Param) %>%
ggplot() +
  geom_point(aes(x=1:length(Test_Param),y=Test_Param, colour = subClust)) +
  scale_colour_gradient(low="red", high="green") +
  facet_wrap(~Clust, scales="free")

grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)

dplyr

cluster-analysis

apply

hierarchical-clustering

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-04-14 23:47:18

应该有一种使用do和tidy组合的方法，但是我总是很难按照我想要的方式来使用do。相反，我通常做的是将基础R中的split和purrr中的map_dfr组合起来。split将通过Clust将数据分解，并给出一个数据格式列表，然后您可以将其映射到上面。map_dfr对每个数据文件进行映射，并返回一个数据文件。

我从您的xy_df_filt开始，生成了我认为应该与您从for循环中获得的xy_df_filt_2相同的内容。我做了两个情节，虽然这两组星系团有点难看。

xy_df_filt_2 <- xy_df_filt %>%
    split(.$Clust) %>%
    map_dfr(function(df) {
        subClust <- hclust(dist(df$Test_Param), method = "single") %>% cutree(., h = 1)

        bind_cols(df, subClust = subClust)
    })

ggplot(xy_df_filt_2, aes(x = x, y = y, color = as.factor(subClust), shape = as.factor(Clust))) +
    geom_point() +
    scale_color_brewer(palette = "Set2")

清清楚楚

ggplot(xy_df_filt_2, aes(x = x, y = y, color = as.factor(subClust), shape = as.factor(Clust))) +
    geom_point() +
    scale_color_brewer(palette = "Set2") +
    facet_wrap(~ Clust)

由reprex封装创建于2018-04-14 (v0.2.0)。

票数 1

Stack Overflow用户

发布于 2018-04-14 17:36:23

你可以为你的子类做这件事。

xy_df_filt_2 <- xy_df_filt %>% 
                group_by(Clust) %>% 
                mutate(subClust = tibble(Test_Param) %>% 
                                  dist() %>% 
                                  hclust(method="single") %>% 
                                  cutree(h=1))

嵌套管道没问题。我认为您的版本的问题在于您没有将正确的对象传递给dist。如果只将一个列传递给tibble，则不需要dist术语，但为了防止您像主集群一样使用几个列，我保留了它。

您可以使用相同的公式(但不使用group_by )从df_ex计算xy_df。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/49784082

复制

相似问题

问如何在集群内进行集群
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在集群内进行集群EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何在集群内进行集群
EN