我有一个从API获得的文章列表,我的数据格式如下所示:
PMID Year Title Journal Author
33326729 2020 Avelumab Maintenance PLoS biology T., Powles
33326729 2020 Avelumab Maintenance PLoS biology B., Huang
33326729 2020 Avelumab Maintenance PLoS biology A., Di Pietro我要把它合并起来:
PMID Year Title Journal Author-1 Author-2 Author-3
33326729 2020 Avelumab Maintenance PLoS biology T., Powles B., Huang A., Di Pietro因此,基本上,我需要文章将作者合并到一个行中。我认为按id进行排序的方法如下:
test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]
Outputs:
33326729 2020,2020,2020 Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance PLoS biology,PLoS biology,PLoS biology T., Powles,B., Huang,A., Di Pietro但是,这将生成带有逗号的数据,而不是单独的列。有没有人知道一个不同的函数,或者如何调整setDT函数以获得我想要的结果?提前感谢
编辑:根据请求,dput(head(PubMed_df))的输出:
structure(list(pmid = c("33326729", "33326729", "33326729", "33320856",
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021",
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles",
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas",
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs",
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom",
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom",
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine",
"The New England journal of medicine", "PLoS biology", "PLoS biology",
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018",
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030",
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))编辑2:非常详细和具体的请求:
我需要将上面显示的数据转换成每一行都有如下形式的数据:每一行都有: PMID /日期:出版日期:作者的出版日期:作者的地址:地址:城市州(如果美国):城市州(如果美国)
我将不得不把地址分开,但这是我以后要关注的事情。现在,我的目标是只获取添加到正确文章中的每个作者的所有信息,而不需要使用同一篇文章的3行。
编辑2-用于从@r2evans获得答案,在我的例子中是这样的:如果您使用dcast作为data.table::dcast,那么提供的答案是有效的!
发布于 2021-01-12 16:54:27
这主要是来自Rui的评论,但它有助于添加一个帮助列来获得它(我将在这里使用row )。既然您开始使用data.table,我将继续使用它。
编辑了以处理更新的数据。(我假设pmid唯一地定义了组。)
library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]并以比伯范围的格式:
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
pmid year month day journal title abstract doi keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3 address_1 address_2 address_3
<char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
1: 33320856 2021 01 07 PLoS biology A sensitive and affordable multiplex... 10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ... Reijns Thompson Acosta Martin A M Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729 2020 12 21 The New England journal of medicine Avelumab Maintenance for Urothelial ... <NA> 10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ... Powles Huang di Pietro Thomas Bo Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy请注意,当您的论文作者少于数据集中的最大作者数时,它们将有空/NA列。例如,如果我删除第5-6行并执行相同的操作,
PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
# pmid year month day journal title abstract doi keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3 address_1 address_2 address_3
# <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
# 1: 33320856 2021 01 07 PLoS biology A sensitive and affordable multiplex... 10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ... Reijns <NA> <NA> Martin A M <NA> <NA> MRC Human Genetics Unit, MRC Institu... <NA> <NA>
# 2: 33326729 2020 12 21 The New England journal of medicine Avelumab Maintenance for Urothelial ... <NA> 10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ... Powles Huang di Pietro Thomas Bo Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italyhttps://stackoverflow.com/questions/65687467
复制相似问题