我必须对发表在20,000多份期刊上的科学论文进行分析。我的名单上有超过45万份记录,但有几份副本(例如:一份来自不同机构的多名作者的论文不止一次出现)。
嗯,我需要统计每本期刊的论文数量,但问题是不同的作者并不总是以相同的方式提供信息,我可以得到如下表格所示的信息:
JOURNAL PAPER
0001-1231 A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING.
0001-1231 DECYCLING WITH A MATCHING
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232 DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232 OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232 ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232 DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION我的目标是使用这样的东西:
data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())所以,我会得到这样的信息:
JOURNAL papers_in_journal
0001-1231 6
0001-1232 7问题是,你可以在发表的论文的名字中看到一些错误。有些在结尾有“句号”;有些有空格或替换符号;有些有其他细微的变化,如W1-HARD与W-HARD。所以,如果我按原样运行代码,我所拥有的是:
JOURNAL papers_in_journal
0001-1231 10
0001-1232 10我的问题是:在使用distinct()或类似的命令时,是否有任何方法可以考虑相似的裕度,这样我就可以使用distinct(日记,纸质%,0.95)之类的东西了吗?
在这个意义上,我希望命令考虑:
A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.
etc.我想使用distinct()没有这样简单的解决方案,而且我无法找到任何替代的命令来做到这一点。因此,如果这是不可能的,并且您可以建议任何我可能使用的消歧算法,我也很感激。
谢谢。
发布于 2020-04-06 14:18:46
一种选择是使用agrep和lapply查找与max.distance不同的期刊文章的索引( agrep的缺省值,您可以用max.distance参数来修改),然后取每篇文章的第一篇文章并使用sapply对其进行矢量化,得到unique索引,向量的长度,以及围绕它的tapply,以选择每个期刊中的“不同”文章的数量。
tapply(data$PAPER, data$JOURNAL, FUN=function(x) {
length(unique(sapply(lapply(x, function(y) agrep(y, x) ), "[", 1))
} )
# 0001-1231 0001-1232
# 6 8 对于以更好的格式返回结果的dplyr版本,我将上述代码放入函数中,然后使用group_by(),然后使用summarise()。
dissimilar <- function(x, distance=0.1) {
length(unique(sapply(lapply(x, function(y)
agrep(y, x, max.distance = distance) ), "[", 1)))
}根据agrep的文档定义“不同的”。
library(dplyr)
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 2 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6
2 0001-1232 8然而,对于一个更大的数据集,例如一个包含数千篇日志和450,000+文章的数据集,上面的内容将相当缓慢(在我的2.50GHzIntel上大约有10-15分钟)。我意识到,dissimilar函数不必要地将每一行与每一行进行比较,这没有什么意义。理想情况下,每一行只应与其自身和所有剩余行进行比较。例如,第一个期刊在第8-12行包含了5篇非常相似的文章。在第8行使用agrep返回所有5个索引,因此不需要将第9-12行与其他任何索引进行比较。因此,我用for循环替换了lapply,这个过程现在只需2-3分钟就能处理45万行的数据集。
dissimilar <- function(x, distance=0.1) {
lst <- list() # initialise the list
k <- 1:length(x) # k is the index of PAPERS to compare with
for(i in k){ # i = each PAPER, k = itself and all remaining
lst[[i]] <- agrep(x[i], x[k], max.distance = distance) + i - 1
# + i - 1 ensures that the original index in x is maintained
k <- k[!k %in% lst[[i]]] # remove elements which are similar
}
lst <- sapply(lst, "[", 1) # take only the first of each item in the list
length(na.omit(lst)) # count number of elements
}现在,扩展原始示例数据集,使包含大约18,000种期刊的450,000条记录中,每一种都包含大约25篇文章。
n <- 45000
data2 <- do.call("rbind", replicate(round(n/26), data, simplify=FALSE))[1:n,]
data2$JOURNAL[27:n] <- rep(paste0("0002-", seq(1, n/25)), each=25)[1:(n-26)]
data2 %>%
group_by(JOURNAL) %>%
summarise(n=dissimilar(PAPER))
# A tibble: 18,001 x 2
JOURNAL n
<chr> <int>
1 0001-1231 6 # <-- Same
2 0001-1232 8
3 0002-1 14
4 0002-10 14
5 0002-100 14
6 0002-1000 13
7 0002-10000 14
8 0002-10001 14
9 0002-10002 14
10 0002-10003 14
# ... with 17,991 more rows我们面临的挑战是找到一种方法来加快这一进程。
发布于 2020-04-06 14:22:59
您将需要使用一个用于自然语言处理的包。试试看这个包裹。
https://stackoverflow.com/questions/61060408
复制相似问题