首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >删除R中具有相似(不相同)字符串的行

删除R中具有相似(不相同)字符串的行
EN

Stack Overflow用户
提问于 2020-11-01 06:01:22
回答 2查看 114关注 0票数 1

我有大量的word文件作为文本导入到r中(每个报告都在一个单元格中),每个主题都有一个ID。

然后,我使用dplyr中的distinct函数删除重复的文件。

然而,一些报告是完全相同的,但有一个微小的差异(例如额外/较少的几个字,额外的空间,等等),所以dplyr没有将它们算作重复。有没有一种有效的方法来删除r中的“高度相似”的项目?

这将创建一个示例数据集(与我正在处理的原始数据相比,非常简化:

代码语言:javascript
复制
d = structure(list(ID = 1:8, text = c("The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method.", 
                                      "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength.", 
                                      "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains.", 
                                      "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."
)), class = "data.frame", row.names = c(NA, -8L))

这是删除完全重复的dplyr代码。但是,您会注意到,第2、7和8项几乎相同

代码语言:javascript
复制
library(dplyr)

d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

看起来在dplyr中有一个like函数,但我可以在这里找到如何准确地应用它(而且它似乎只适用于短字符串,例如单词) dplyr filter() with SQL-like %wildcard%

此外,还有一个包tidystringdist可以计算2个字符串的相似度,但无法在这里应用它来删除相似但不相同的项。https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

在这一点上有什么建议或指导吗?

更新:

看起来stringdist包可能会按照下面用户的建议解决这个问题。

rstudio网站上的这个问题处理了类似的问题,尽管期望的输出略有不同。我将他们的代码应用到我的数据中,并能够识别出类似的数据。https://community.rstudio.com/t/identifying-fuzzy-duplicates-from-a-column/35207/2

代码语言:javascript
复制
library(tidystringdist)
library(tidyverse)

# First remove any duplicates: 
d =d %>% 
  distinct(text, .keep_all = T) %>% 
  View()

# this will identify the similar ones and place then in one dataframe called match: 
match <- d %>% 
  tidy_comb_all(text) %>% 
  tidy_stringdist() %>% 
  filter(soundex == 0) %>% # Set a threshold
  gather(x, match, starts_with("V")) %>% 
  .$match

# create negate function of %in%:

 `%!in%` = Negate(`%in%`)

# this will remove those in the `match` out of `d` :
d2 = d %>% 
  filter(text %!in% match) %>% 
  arrange(text)

使用上面的代码,d2根本没有任何副本/相似的副本,但我想保留它们的一个副本。

有没有关于如何保留一份副本的想法(例如只有第一次出现)?

EN

回答 2

Stack Overflow用户

发布于 2020-11-01 10:03:35

代码语言:javascript
复制
library(stringdist)


dd <- d[ !duplicated( d[['test']] ) , ]
dput(dd)
# --------------
[1] "The properties of plastics depend on the chemical composition of the subunits, the arrangement of these subunits, and the processing method."                                                                                                                                                                              
[2] "Plastics are usually poor conductors of heat and electricity. Most are insulators with high dielectric strength."                                                                                                                                                                                                          
[3] "All plastics are polymers but not all polymers are plastic. Plastic polymers consist of chains of linked subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."    
[4] "All plastics are polymers however not all polymers are plastic. Plastic polymers consist of chains of linked subunits named monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains." 
[5] "all plastics are polymers   but not all polymers are plastic. Plastic polymers consist of chains of linked   subunits called monomers. If identical monomers are joined, it forms a homopolymer. Different monomers link to form copolymers. Homopolymers and copolymers may be either straight chains or branched chains."

unname( sapply(dd, stringdist, dd, method="dl") )
#------------------
     [,1] [,2] [,3] [,4] [,5]
[1,]    0  105  231  235  235
[2,]  105    0  234  238  238
[3,]  231  234    0   10    5
[4,]  235  238   10    0   13
[5,]  235  238    5   13    0

THe距离与字符串长度相关,因此较短的字符串具有较大的最大距离,但对于这种情况,20的上限看起来就足够了。一个适当的解决方案应该使用“距离”与该向量元素的nchar的某个比率。

不是作为最终解决方案提供,而是作为第1步和第2步提供。

票数 1
EN

Stack Overflow用户

发布于 2020-11-01 08:56:32

我相信这个包就是你要找的:fuzzyjoin

提供了许多模糊距离函数,但如果模糊距离很小,则本质上两个条目是“相似的”。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64626818

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档