我正在编写一个Rails应用程序,用于从新闻页面获取RSS提要,对标题应用词性标记,从标题中获取名词短语以及每个短语出现的次数。我需要过滤掉属于其他名词短语的名词短语,并使用以下代码来完成此操作:
filtered_noun_phrases = sorted_noun_phrases.select{|a|
sorted_noun_phrases.keys.any?{|b| b != a and a.index(b) } }.to_h所以这就是:
{"troops retake main government office"=>2,
"retake main government office"=>2, "main government office"=>2}应该变得公正:
{"troops retake main government office"=>2}但是,名词短语的有序散列如下:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2, "government office"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2, "government base"=>2,
"boko haram teen"=>2, "horn chainsawed"=>2, "fighting racism"=>2,
"silver surfers"=>2, "house tourists"=>2, "natural causes"=>2,
"george michael"=>2, "instagram fame"=>2, "hacking tools"=>2,
"iraqi forces"=>2, "mosul battle"=>2, "own wedding"=>2, "french zoo"=>2,
"haram teen"=>2, "hacked tvs"=>2, "shot dead"=>2}取而代之的是部分过滤:
{"troops retake main government office"=>2, "chinese students fighting racism"=>2,
"retake main government office"=>2, "mosul retake government base"=>2,
"toddler killer shot dead"=>2, "students fighting racism"=>2,
"retake government base"=>2, "main government office"=>2,
"white house tourists"=>2, "horn at french zoo"=>2,
"cia hacking tools"=>2, "killer shot dead"=>2,
"boko haram teen"=>2}那么,如何从实际有效的散列中过滤出重复的子字符串呢?
发布于 2017-03-08 03:39:52
您当前所做的是选择作为短语的子字符串的任何短语存在的所有短语。
对于“军队夺回主要政府办公室”,这是正确的,因为我们发现“夺回主要政府办公室”。
然而,对于“收回主要政府办公室”,我们仍然可以找到“主要政府办公室”,因此没有将其过滤掉。
举个例子:
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_h您可以拒绝存在包含该短语的任何字符串的所有短语。
发布于 2017-03-08 03:36:50
filtered_noun_phrases = sorted_noun_phrases.reject{|a| sorted_noun_phrases.keys.any?{|b| b != a and b.index(a) } }.to_htrueunlessfalse
https://stackoverflow.com/questions/42656592
复制相似问题