文章/答案/技术大牛

发布

社区首页 >问答首页 >有没有更好的方法来根据R中的关键字对叙述进行分类？

问有没有更好的方法来根据R中的关键字对叙述进行分类？
EN

Stack Overflow用户

提问于 2020-04-15 03:42:10

回答 1查看 45关注 0票数 0

我试着根据特定的关键字词典对叙述进行分类。我的方法是用最小的字符串距离识别关键字和旁白。这种方法工作得很好，但我遇到了一个这种方法似乎不太合适的例子。以下是代码的一小段

#a is the narration and b(s) are some keywords
a = "PRAJA GHUPTA UTAMA Trf Inw RTGS PT BANK NEGARA INDONESIA (PERSERO) TBKPRAJA GHUPTA UTAMA"
b1 = "tarik"
b2 = "pajak"
b3 = "trf inw rtgs"

#After loading stringdist library
dis1 = stringdist(tolower(a),b1,method = "jw") 
dis2 = stringdist(tolower(a),b2,method = "jw")
dis3 = stringdist(tolower(a),b3,method = "jw")

#Output 
> dis1
[1] 0.3810606

> dis2
[1] 0.3143939

> dis3
[1] 0.4406566

据我所知，stringdist函数首先回收较短的字符串以匹配较长的长度，然后根据匹配这两个字符串所需的迭代次数来计算差值。

我不明白的是，b3是叙述性a的一个子串，但与其他关键词相比却没有明显的距离。

想知道这背后是否有任何原因，以及我可以尝试哪些替代方法来实现更好的匹配？

text-mining

text-classification

fuzzy-search

回答 1

Stack Overflow用户

发布于 2020-04-15 04:37:26

这里的关键是要注意，stringdist()适用于字符，而问题似乎想要找到单词的相似性，因此请考虑以下内容：

# Note this does not attempt to explain all nuances, but only the word versus character aspect:
a = "PRAJA GHUPTA UTAMA Trf Inw RTGS PT BANK NEGARA INDONESIA (PERSERO) TBKPRAJA GHUPTA UTAMA"
b1 = "tarik"      
b2 = "pajak"
b3 = "trf inw rtgs"
b4 = "PRAJA GHUPTA"   # exactly same char. seq. but nchar = 11 - higher score     
b5 = "PRAJA G"        # exactly same char. seq. but nchar = 6  - lower score
b6 = "PRAJA G"        # same, stringdist(b5,b6,method = "jw") = 0 as expected
b7 = "paa gua uaa"    # dis7 = stringdist(tolower(a),b7,method = "jw")
library(stringdist)
library(stringi)
library(stringr)
#After loading stringdist library
dis1 = stringdist(tolower(a),b1,method = "jw") 
dis2 = stringdist(tolower(a),b2,method = "jw")
dis3 = stringdist(tolower(a),b3,method = "jw")
dis4 = stringdist(tolower(a),b4,method = "jw")
dis5 = stringdist(tolower(a),b5,method = "jw")
dis6 = stringdist(b5,b6,method = "jw")

# This uses b7, where b7 nchar=9, but only 4 unique chars p,a,g & u - all available early
dis7 = stringdist(tolower(a),b7,method = "jw")   # relatively 2nd lowest score : 0.2916667


dis1;dis2;dis3;dis4;dis5; dis6; dis7

> dis1;dis2;dis3;dis4;dis5; dis6; dis7
[1] 0.3810606
[1] 0.3143939
[1] 0.4406566
[1] 0.635101
[1] 0.6152597
[1] 0
[1] 0.2916667

# other aspects are explained in the vignette /help pages

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61215872

复制

相似问题

问有没有更好的方法来根据R中的关键字对叙述进行分类？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有没有更好的方法来根据R中的关键字对叙述进行分类？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问有没有更好的方法来根据R中的关键字对叙述进行分类？
EN