我要通过多篇论文和文章一起发表来研究科研机构之间的协作网络。每一篇文章都有一个标识特定文章的唯一代码。如果两个(或更多)研究所的数据库中有相同的ArticleCode,这就意味着在发表这篇文章时要在它们之间进行协作。
以下是数据集的组织方式:
Intitute name | ArticleName | ArticleCode | Area | Pages | NumberofCitations | ...我总共有90个学院,所以90 csv文件在这里。我必须达到的结果是,在一张表中,我得到了以下信息:
Institute#1 | Institute#2 | TotArticles | TotArea#1 | TotArea#2 | TotArea#3 |...因此,我需要每一对已经合作过的研究所的夫妇,他们的名字,发表的文章总数(TotArticles),以及他们在其他专栏中的细分(总共5篇:艺术与人文;生命科学与生物医学;物理科学;社会科学;技术)。
一开始我以为可以通过ArticleCode合并csv,但我很快意识到,如果我想分析所有研究所之间所有可能的组合,我将不得不重复合并步骤4000次.巨大的浪费时间。
也许我可以更快地完成,如果我合并在一个独特的csv,所有机构的出版物总数,然后操作它,以获得最后的表格。添加90 csv的所有行,我将获得类似于1.300.000行的内容,不知道在技术上是否有可能对其进行操作。
下面是一个例子。
希望我能把问题暴露得足够清楚,否则就告诉我一句。
从这种事情开始..。
Institute | ArticleCode | Area | Pages | ...
In.AAA | articleX | Arts & Humanities| 90 | ...
In.AAA | articleP | Technology | 10 | ...
In.BBB | articleZ | ... | 907 | ...
In.BBB | articleX | Arts & Humanities| 90 | ...
In.CCC | articleF | Arts & Humanities| 89 | ...
In.DDD | articleP | Technology | 10 | ...
In.DDD | articleX | Arts & Humanities| 90 | ...对此:
Institute#1 | Institute#2 | TotArticles |Arts & Humanities | TotTechnology
In. AAA | In.BBB | 1 | 1 | 0
In. AAA | In.DDD | 2 | 1 | 1
In. BBB | In.DDD | 1 | 1 | 0 发布于 2022-06-02 20:01:11
您可以将Institute列转换为数字因子,然后使用data.table将其本身的表与非equi联接连接起来。
library(data.table)
setDT(df)[, Institute:=as.factor(Institute)]
result = dcast(
df[df[, Institute2:=Institute], on=.(ArticleCode, Institute>Institute2), nomatch=0],
Institute+Institute2~Area,fun.aggregate = length
)[, TotArticles:=rowSums(.SD), .SDcols = -c(1,2)]输出:
Institute Institute2 Arts & Humanities Technology TotArticles
1: In.AAA In.BBB 1 0 1
2: In.AAA In.DDD 1 1 2
3: In.BBB In.DDD 1 0 1输入:
structure(list(Institute = c("In.AAA", "In.AAA", "In.BBB", "In.BBB",
"In.CCC", "In.DDD", "In.DDD"), ArticleCode = c("articleX", "articleP",
"articleZ", "articleX", "articleF", "articleP", "articleX"),
Area = c("Arts & Humanities", "Technology", "...", "Arts & Humanities",
"Arts & Humanities", "Technology", "Arts & Humanities"),
Pages = c("90", "10", "907", "90", "89", "10", "90")), class = "data.frame", row.names = c(NA,
-7L))更新(6/6/22)
OP现在希望添加一个额外的列,比如Year,它指示文章的年份,并根据该值分隔行。只需要两个小的改动:
Year添加到dcastc(1,2)到c(1:3)从rowSums中排除的列(即从前两列到前三列)。
setDT(df)[, Institute:=as.factor(Institute)]
result = dcast(
df[df[, Institute2:=Institute], on=.(ArticleCode, Institute>Institute2), nomatch=0],
Year + Institute+Institute2~Area,fun.aggregate = length
)[, TotArticles:=rowSums(.SD), .SDcols = -c(1:3)]输出:
Key: <Year, Institute, Institute2>
Year Institute Institute2 Arts & Humanities Technology TotArticles
<num> <fctr> <fctr> <int> <int> <num>
1: 2005 In.AAA In.BBB 1 0 1
2: 2005 In.AAA In.DDD 1 0 1
3: 2005 In.BBB In.DDD 1 0 1
4: 2006 In.AAA In.DDD 0 1 1新投入:
structure(list(Institute = c("In.AAA", "In.AAA", "In.BBB", "In.BBB",
"In.CCC", "In.DDD", "In.DDD"), ArticleCode = c("articleX", "articleP",
"articleZ", "articleX", "articleF", "articleP", "articleX"),
Area = c("Arts & Humanities", "Technology", "...", "Arts & Humanities",
"Arts & Humanities", "Technology", "Arts & Humanities"),
Pages = c("90", "10", "907", "90", "89", "10", "90"), Year = c(2005,
2006, 2005, 2005, 2005, 2006, 2005)), row.names = c(NA, -7L
), class = "data.frame")https://stackoverflow.com/questions/72480210
复制相似问题