文章/答案/技术大牛

发布

社区首页 >问答首页 >泉泰达:如何删除我自己的单词列表

问泉泰达:如何删除我自己的单词列表
EN

Stack Overflow用户

提问于 2017-07-26 12:51:37

回答 1查看 5.6K关注 0票数 6

由于波兰语在quanteda中没有现成的句号，我想使用我自己的清单。我把它放在文本文件中，作为一个由空格分隔的列表。如果需要的话，我也可以准备一个用新行分隔的列表。

如何从我的语料库中删除自定义的冗长的停止词列表？我怎么才能在堵塞之后做到呢？

我尝试过创建各种格式，转换为字符串向量，如

stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)

我也尝试在语法中使用这些词的向量。

myStemMat <-
  dfm(
    mycorpus,
    remove = as.vector(stopwordsPL),
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3)
  )

dfm_trim(myStemMat, sparsity = stopwordsPL)

或

myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))

毫无办法。我的话出现在语料库和分析中。应用自定义停止词的正确方式/语法应该是什么？

text-mining

quanteda

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-07-26 13:37:09

假设您的polish.stopwords.txt类似于这，那么您应该能够轻松地从您的语料库中删除它们：

stopwordsPL <- readLines("polish.stopwords.txt", encoding = "UTF-8")

dfm(mycorpus,
    remove = stopwordsPL,
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3))

使用readtext的解决方案无法工作，因为它将整个文件作为一个文档读取。要获得单个单词，您需要对其进行令牌化，并将符号强制转换为字符。可能readLines()更容易一些。

也不需要从stopwordsPL创建字典，因为remove应该使用字符向量。此外，恐怕目前还没有波兰树干机的使用。

目前(v0.9.9-65)，dfm()中的特征删除并没有去掉构成大写的停止词。若要重写此操作，请尝试：

# form the tokens, removing punctuation
mytoks <- tokens(mycorpus, remove_punct = TRUE)
# remove the Polish stopwords, leave pads
mytoks <- tokens_remove(mytoks, stopwordsPL, padding = TRUE)
## can't do this next one since no Polish stemmer in 
## SnowballC::getStemLanguages()
# mytoks <- tokens_wordstem(mytoks, language = "polish")
# form the ngrams
mytoks <- tokens_ngrams(mytoks, n = c(1, 3))
# construct the dfm
dfm(mytoks)

票数 10

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/45327556

复制

相似问题

问泉泰达:如何删除我自己的单词列表
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问泉泰达:如何删除我自己的单词列表EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问泉泰达:如何删除我自己的单词列表
EN