我试着根据特定的关键字词典对叙述进行分类。我的方法是用最小的字符串距离识别关键字和旁白。这种方法工作得很好,但我遇到了一个这种方法似乎不太合适的例子。以下是代码的一小段
#a is the narration and b(s) are some keywords
a = "PRAJA GHUPTA UTAMA Trf Inw RTGS PT BANK NEGARA INDONESIA (PERSERO) TBKPRAJA GHUPTA UTAMA"
b1 = "tarik"
b2 = "pajak"
b3 = "trf inw rtgs"
#After loading stringdist library
dis1 = stringdist(tolower(a),b1,method = "jw")
dis2 = stringdist(tolower(a),b2,method = "jw")
dis3 = stringdist(tolower(a),b3,method = "jw")
#Output
> dis1
[1] 0.3810606
> dis2
[1] 0.3143939
> dis3
[1] 0.4406566据我所知,stringdist函数首先回收较短的字符串以匹配较长的长度,然后根据匹配这两个字符串所需的迭代次数来计算差值。
我不明白的是,b3是叙述性a的一个子串,但与其他关键词相比却没有明显的距离。
想知道这背后是否有任何原因,以及我可以尝试哪些替代方法来实现更好的匹配?
发布于 2020-04-15 04:37:26
这里的关键是要注意,stringdist()适用于字符,而问题似乎想要找到单词的相似性,因此请考虑以下内容:
# Note this does not attempt to explain all nuances, but only the word versus character aspect:
a = "PRAJA GHUPTA UTAMA Trf Inw RTGS PT BANK NEGARA INDONESIA (PERSERO) TBKPRAJA GHUPTA UTAMA"
b1 = "tarik"
b2 = "pajak"
b3 = "trf inw rtgs"
b4 = "PRAJA GHUPTA" # exactly same char. seq. but nchar = 11 - higher score
b5 = "PRAJA G" # exactly same char. seq. but nchar = 6 - lower score
b6 = "PRAJA G" # same, stringdist(b5,b6,method = "jw") = 0 as expected
b7 = "paa gua uaa" # dis7 = stringdist(tolower(a),b7,method = "jw")
library(stringdist)
library(stringi)
library(stringr)
#After loading stringdist library
dis1 = stringdist(tolower(a),b1,method = "jw")
dis2 = stringdist(tolower(a),b2,method = "jw")
dis3 = stringdist(tolower(a),b3,method = "jw")
dis4 = stringdist(tolower(a),b4,method = "jw")
dis5 = stringdist(tolower(a),b5,method = "jw")
dis6 = stringdist(b5,b6,method = "jw")
# This uses b7, where b7 nchar=9, but only 4 unique chars p,a,g & u - all available early
dis7 = stringdist(tolower(a),b7,method = "jw") # relatively 2nd lowest score : 0.2916667
dis1;dis2;dis3;dis4;dis5; dis6; dis7
> dis1;dis2;dis3;dis4;dis5; dis6; dis7
[1] 0.3810606
[1] 0.3143939
[1] 0.4406566
[1] 0.635101
[1] 0.6152597
[1] 0
[1] 0.2916667
# other aspects are explained in the vignette /help pageshttps://stackoverflow.com/questions/61215872
复制相似问题