首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >R函数根据id合并行并创建单独的列

R函数根据id合并行并创建单独的列
EN

Stack Overflow用户
提问于 2021-01-12 15:58:54
回答 1查看 81关注 0票数 1

我有一个从API获得的文章列表,我的数据格式如下所示:

代码语言:javascript
复制
PMID        Year     Title                  Journal         Author 
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles
33326729    2020     Avelumab Maintenance   PLoS biology    B., Huang
33326729    2020     Avelumab Maintenance   PLoS biology    A., Di Pietro

我要把它合并起来:

代码语言:javascript
复制
PMID        Year     Title                  Journal         Author-1         Author-2     Author-3
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles       B., Huang    A., Di Pietro

因此,基本上,我需要文章将作者合并到一个行中。我认为按id进行排序的方法如下:

代码语言:javascript
复制
test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]

Outputs:
33326729    2020,2020,2020     Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance   PLoS biology,PLoS biology,PLoS biology    T., Powles,B., Huang,A., Di Pietro

但是,这将生成带有逗号的数据,而不是单独的列。有没有人知道一个不同的函数,或者如何调整setDT函数以获得我想要的结果?提前感谢

编辑:根据请求,dput(head(PubMed_df))的输出:

代码语言:javascript
复制
structure(list(pmid = c("33326729", "33326729", "33326729", "33320856", 
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021", 
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles", 
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas", 
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs", 
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom", 
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom", 
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine", 
"The New England journal of medicine", "PLoS biology", "PLoS biology", 
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018", 
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030", 
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))

编辑2:非常详细和具体的请求:

我需要将上面显示的数据转换成每一行都有如下形式的数据:每一行都有: PMID /日期:出版日期:作者的出版日期:作者的地址:地址:城市州(如果美国):城市州(如果美国)

我将不得不把地址分开,但这是我以后要关注的事情。现在,我的目标是只获取添加到正确文章中的每个作者的所有信息,而不需要使用同一篇文章的3行。

编辑2-用于从@r2evans获得答案,在我的例子中是这样的:如果您使用dcast作为data.table::dcast,那么提供的答案是有效的!

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-01-12 16:54:27

这主要是来自Rui的评论,但它有助于添加一个帮助列来获得它(我将在这里使用row )。既然您开始使用data.table,我将继续使用它。

编辑了以处理更新的数据。(我假设pmid唯一地定义了组。)

代码语言:javascript
复制
library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]

并以比伯范围的格式:

代码语言:javascript
复制
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
       pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1                               address_2                               address_3
     <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>                                  <char>                                  <char>
1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns   Thompson     Acosta  Martin A M      Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ...                      Pfizer, Groton, CT                    Pfizer, Milan, Italy

请注意,当您的论文作者少于数据集中的最大作者数时,它们将有空/NA列。例如,如果我删除第5-6行并执行相同的操作,

代码语言:javascript
复制
PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
#        pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1          address_2            address_3
#      <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>             <char>               <char>
# 1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns       <NA>       <NA>  Martin A M        <NA>        <NA> MRC Human Genetics Unit, MRC Institu...               <NA>                 <NA>
# 2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/65687467

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档