我有一个数据集,其中不同的行有不同的元素组合,我想提取具有相同元素组合的行组。对于此示例数据集:
id <- c("A", "B", "C", "D")
X1 <- c(NA,NA,NA,"X1")
X2 <- c(NA,NA,"X2","X2")
X3 <- c("X3","X3","X3","X3")
X4 <- c("X4", "X4", "X4", "X4")
df <- data.frame(id,X1,X2,X3,X4)
> df
id X1 X2 X3 X4
1 A <NA> <NA> X3 X4
2 B <NA> <NA> X3 X4
3 C <NA> X2 X3 X4
4 D X1 X2 X3 X4我想要能拉出来
我尝试将数据帧拆分为列表并删除空单元格,以便每个id在列表中获得自己的data.frame:
df.list <- split(df, seq(nrow(df)))
dfComplete.list <- lapply(df.list, function(remNA) remNA[,colSums(is.na(remNA)) < nrow(remNA)])这让我有了
> dfComplete.list
$`1`
id X3 X4
1 1 X3 X4
$`2`
id X3 X4
2 2 X3 X4
$`3`
id X2 X3 X4
3 3 X2 X3 X4
$`4`
id X1 X2 X3 X4
4 4 X1 X2 X3 X4我很想知道从这里往哪里走。是否有一种方法可以根据元素/列的共同之处对列表中的数据进行分组?
我正在处理的真实数据集实际上通过X7具有元素/列X17,每个id都有介于1到4个元素之间的某个位置,因此理想的解决方案将能够识别数据中所有元素的组合。
最后,在将数据重新格式化为上述格式之前,我的数据最初是以下面的长形式出现的,以防有一种更容易从原始格式找到解决方案的方法:
id <- c("A", "A", "B", "B", "C", "C", "C", "D", "D", "D", "D")
elements <- c("X3", "X4", "X3", "X4", "X2", "X3", "X4", "X1", "X2", "X3", "X4")
dataLong <- data.frame(id, elements)
> dataLong
id elements
1 A X3
2 A X4
3 B X3
4 B X4
5 C X2
6 C X3
7 C X4
8 D X1
9 D X2
10 D X3
11 D X4提前感谢您的帮助!
发布于 2018-05-02 21:08:07
reshape2::dcast函数可以帮助将数据从长格式转换为OP期望的格式。
#Data
id <- c("A", "A", "B", "B", "C", "C", "C", "D", "D", "D", "D")
elements <- c("X3", "X4", "X3", "X4", "X2", "X3", "X4", "X1", "X2", "X3", "X4")
dataLong <- data.frame(id, elements, stringsAsFactors = FALSE)
library(reshape2)
#Use dcast to get the result
dataLong %>% dcast(id~elements)
# id X1 X2 X3 X4
# 1 A <NA> <NA> X3 X4
# 2 B <NA> <NA> X3 X4
# 3 C <NA> X2 X3 X4
# 4 D X1 X2 X3 X4发布于 2018-05-02 21:10:06
我知道你想数唯一的组合。我就是这样做的
library(dplyr)
library(tidyr)
dataLong %>% mutate(value=1) %>%
spread(elements, value) %>%
select(-id) %>%
group_by_all() %>%
summarise(count=n()) %>% ungroup()
#> # A tibble: 3 x 5
#> X1 X2 X3 X4 count
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1 1 1 1
#> 2 NA 1 1 1 1
#> 3 NA NA 1 1 2发布于 2018-05-02 21:32:05
您可以使用tidyverse进行此操作!arrange()的使用有点多余,但我想向您展示这个选项,因为它将安排您的数据以反映您感兴趣的分组(您可以将其看作是一种嵌套排序)。这可能就是你所需要的。
如果您想要实际计数,以及告诉您哪些If对应于哪个组合的列,那么只需运行下面的完整代码。注意,您必须在完整的代码中添加所有变量(X7:X17)。在声明数据格式时,您还需要使用stringsAsFactors = FALSE,这通常是一个很好的实践。
# Your example dataframe. Make sure to set stringsAsFactors = FALSE
id <- c("A", "B", "C", "D")
X1 <- c(NA,NA,NA,"X1")
X2 <- c(NA,NA,"X2","X2")
X3 <- c("X3","X3","X3","X3")
X4 <- c("X4", "X4", "X4", "X4")
df <- data.frame(id,X1,X2,X3,X4, stringsAsFactors = FALSE)
# We group rows by all unique combinations and then collapse those rows,
# while recording which ids belong to which grouping, and how many there are
# in each.
library(tidyverse)
ndf <- arrange(df, X1,X2,X3,X4) %>%
group_by(X1,X2,X3,X4) %>%
summarise(num = n(), id = paste(id, collapse=","))
# Output:
# A tibble: 3 x 6
# Groups: X1, X2, X3 [?]
X1 X2 X3 X4 num id
<chr> <chr> <chr> <chr> <int> <chr>
1 X1 X2 X3 X4 1 D
2 <NA> X2 X3 X4 1 C
3 <NA> <NA> X3 X4 2 A,B https://stackoverflow.com/questions/50143471
复制相似问题