我有大量的word文件作为文本导入到r中(每个报告都在一个单元格中),每个主题都有一个ID。
然后,我使用dplyr中的distinct函数删除重复的文件。
然而,一些报告是完全相同的,但有一个微小的差异(例如额外/较少的几个字,额外的空间,等等),所以dplyr没有将它们算作重复。有没有一种有效的方法来删除r中的“高度相似”的项目?
这将创建一个示例数据集(与我正在处理的原始数据相比,非常简化:
d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.",
"Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.",
"All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.",
"all plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))这是删除完全重复的dplyr代码。但是,您会注意到,第2、7和8项几乎相同
library(dplyr)
d %>%
distinct(text, .keep_all = T) %>%
View()看起来在dplyr中有一个like函数,但我可以在这里找到如何准确地应用它(而且它似乎只适用于短字符串,例如单词) dplyr filter() with SQL-like %wildcard%
此外,还有一个包tidystringdist可以计算2个字符串的相似度,但无法在这里应用它来删除相似但不相同的项。https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html
在这一点上有什么建议或指导吗?
更新:
看起来stringdist包可能会按照下面用户的建议解决这个问题。
rstudio网站上的这个问题处理了类似的问题,尽管期望的输出略有不同。我将他们的代码应用到我的数据中,并能够识别出类似的数据。https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2
library(tidystringdist)
library(tidyverse)
# First remove any duplicates:
d =d %>%
distinct(text, .keep_all = T) %>%
View()
# this will identify the similar ones and place then in one dataframe called match:
match <- d %>%
tidy_comb_all(text) %>%
tidy_stringdist() %>%
filter(soundex == 0) %>% # Set a threshold
gather(x, match, starts_with("V")) %>%
.$match
# create negate function of %in%:
`%!in%` = Negate(`%in%`)
# this will remove those in the `match` out of `d` :
d2 = d %>%
filter(text %!in% match) %>%
arrange(text)使用上面的代码,d2根本没有任何副本/相似的副本,但我想保留它们的一个副本。
有没有关于如何保留一份副本的想法(例如只有第一次出现)?
发布于 2020-11-01 10:03:35
library(stringdist)
dd <- d[ !duplicated( d[['test']] ) , ]
dput(dd)
# --------------
[1] "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method."
[2] "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength."
[3] "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
[4] "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
[5] "all plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
unname( sapply(dd, stringdist, dd, method="dl") )
#------------------
[,1] [,2] [,3] [,4] [,5]
[1,] 0 105 231 235 235
[2,] 105 0 234 238 238
[3,] 231 234 0 10 5
[4,] 235 238 10 0 13
[5,] 235 238 5 13 0THe距离与字符串长度相关,因此较短的字符串具有较大的最大距离,但对于这种情况,20的上限看起来就足够了。一个适当的解决方案应该使用“距离”与该向量元素的nchar的某个比率。
不是作为最终解决方案提供,而是作为第1步和第2步提供。
发布于 2020-11-01 08:56:32
我相信这个包就是你要找的:fuzzyjoin。
提供了许多模糊距离函数,但如果模糊距离很小,则本质上两个条目是“相似的”。
https://stackoverflow.com/questions/64626818
复制相似问题