首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >R-高效地查找具有几乎相同数据的行,并将差异粘贴到一个单元格中。

R-高效地查找具有几乎相同数据的行,并将差异粘贴到一个单元格中。
EN

Stack Overflow用户
提问于 2015-08-11 00:44:01
回答 2查看 266关注 0票数 0

假设我有一个数据框架

代码语言:javascript
复制
 Data <- data.frame("Name", "Age", "Weight", "School", "Book" , "Author")
 Data[1,] <- c("Paul", 26, 150, "Helgason U", "Intro to Smooth Manifolds", "John Lee")
 Data[2,] <- c("Paul", 26, 150, "Helgason U", "A Tale of Two Cities", "Charles Dickens")
 Data[3,] <- c("Paul", 26, 150, "Helgason U", "Fear and Loathing in Las Vegas", "Hunter Thompson")
 Data[4,] <- c("Paul", 26, 150, "Helgason U", "Gravity's Rainbow", "Thomas Pynchon")
 Data[5,] <- c("David", 35, 165, "Turing College", "Brave New World", "Aldous Huxley")
 Data[6,] <- c("David", 35, 165, "Turing College", "Vashista's Yoga", "Vashista")
 Data[7,] <- c("David", 35, 165, "Turing College", "C++ For Dummies", "Anonymous")

我想压缩数据,这样对应于同一个人的所有行都可以放进一行,而众多的书籍和作者可以连在一起。换句话说,我希望我的产出是:

代码语言:javascript
复制
    Name     Age     Weight     School     Books                          Authors
    Paul     26       150     Helgason U   Intro to Smooth Manifolds      John Lee
                                           A Tale of Two Cities           Charles Dickens
                                           Fear and Loathing in Las Vegas Hunter Thompson
                                           Gravity's Rainbow              Thomas Pynchon
    David    35       165   Turing College Brave New World                Aldous Huxley
                                           Vashista's Yoga                Vashista
                                           C++ For Dummies                Anonymous

理想情况下,我希望这些书可以连接为"Intro to Smooth Manifolds\nA Tale of Two Cities\nFear and Loathing in Las Vegas\nGravity's Rainbow"

最初我使用了for循环,但这太慢了,因为我的实际数据远大于此。想了解一下我是怎么循环的:

代码语言:javascript
复制
  for (i in 1:L){
    Names = subset(Data, Data$Name == unique(Data$Names)[i])
    rows = nrow(Names)

    Name_Matches = which(duplicated(Names[,Cols]) | duplicated(Names[nrow(Names):1, Cols])[nrow(Names):1])
    Name_UnMtchs = setdiff(1:nrow(Names), Name_Matches)

    Books        = Names$Book[Name_Matches]
    New_Books    = paste(as.character(Books), collapse = "\n")
    Authors     = Names$Author[Name_Matches]
    New_Authors = paste(Authors, collapse = "\n")

    New_Data[count_New, Cols] = Names[Name_Matches[1], Cols]
    New_Data$Book             = New_Books
    New_Data$Author           = New_Authors
    count_New                 = count_New + 1
    }

对于一个人(年龄、体重、学校、姓名),Cols是我知道的条目的列索引,L是数据帧中唯一名称的数目,count_New是在1初始化开始的计数器,New_Data是一个空数据框架,与Data列相同。我可以使用什么函数来合并数据而不使用这种for循环呢?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2015-08-11 01:08:54

这类事情可以用基R来完成,但是最好是使用一个专门为数据争论设计的包。

在dplyr中:

代码语言:javascript
复制
require(dplyr)

Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=paste(Book, collapse="\n"), Authors=paste(Author, collapse="\n"))

不过,我想这是你真正想要的。它不是将书名(和作者)粘贴到每个名称的一个字符串中,而是将它们转换为标题的向量,然后再用于进一步的处理。

代码语言:javascript
复制
Data %>%
  group_by(Name, Age, Weight, School) %>%
  summarise(Books=list(Book), Authors=list(Author))
票数 3
EN

Stack Overflow用户

发布于 2015-08-11 04:43:13

考虑一下这个R基解决方案(尽管没有效率或优雅):

代码语言:javascript
复制
# OBTAIN UNIQUE PERSONS DATAFRAME
Data2 <- unique(Data[1:4])
rownames(Data2) <- NULL

# GET LIST OF DISTINCT PERSONS
persons = c(Data2[1]) 

# LOOP THROUGH DISTINCT PERSONS
for (j in persons){
  for (k in 0:length(persons)+1){
  # BOOK COLUMN (PULL INTO LIST, THEN CONCATENATE)  
  books <- c(Data[Data$Name==j[k],][5])
  booksconcat <- paste(books[[1]], collapse="\n")
  Data2$Book[Data2$Name==j[k]] <- booksconcat    

  # AUTHOR COLUMN (PULL INTO LIST, THEN CONCATENATE)
  authors <- c(Data[Data$Name==j[k],][6])
  authorsconcat <- paste(authors[[1]], collapse="\n")
  Data2$Author[Data2$Name==j[k]] <- authorsconcat    
  }
}
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/31931422

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档