我正在学习R中的文本挖掘,并取得了很好的成功。但是我被困在如何处理复数上了。也就是说,我希望“国家”和“国家”被算作同一个词,理想情况下,“字典”和“字典”被算作同一个词。
x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.'发布于 2016-01-22 10:49:45
一种可能的解决方案。在这里,我使用pacman包来使解决方案自包含:
if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load_gh('hrbrmstr/pluralize')
p_load(quanteda)
x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries"'
singularize(unlist(tokenize(x)))
## [1] "\"" "nation" "\"" "and" "\"" "nation" "\""
## [8] "to" "be" "counted" "a" "the" "same" "word"
## [15] "and" "ideally" "\"" "dictionary" "\"" "and" "\""
## [22] "dictionary" "\"" 发布于 2021-04-06 15:45:49
SemNetCleaner包有一个奇异化函数。它比pluralize包慢,但我发现它对名词的处理更好。例如,Mars不会转换为Mar.
https://stackoverflow.com/questions/34938023
复制相似问题