最近,我一直在用R编写文本挖掘代码,但我在处理数据预处理方面遇到了困难。我下面有一根这样的绳子:
"I want to buy 3D printer, but it costs 3000 dollars."我想保留单词"3D“,但删除"3000",应该如下所示:
"I want to buy 3D printer, but it costs dollars."我使用corpus <- tm_map(corpus, removeNumbers),但这将删除文本中的所有数字,因此我将在结果中使用术语"D打印机“,但它应该是"3D打印机”。
有什么办法可以解决这个问题吗?谢谢!
发布于 2015-12-09 06:43:09
我们可以使用sub
gsub('3\\d+\\s', '', str1)如果这需要一般的话,
gsub('\\b\\d+\\s', '', str1)
#[1] "I want to buy 3D printer, but it costs dollars."发布于 2015-12-10 10:41:36
您还可以使用文本分析包,例如quanteda,它只删除数字,而不是数字。所以在你的情况下:
require(quanteda)
tokenize("I want to buy 3D printer, but it costs 3000 dollars.", removeNumbers = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
## [1] "I" "want" "to" "buy" "3D" "printer" "," "but" "it" "costs" "dollars" "." 如果希望将其作为单个字符对象返回,而不进行标记化(尽管这可能是您的目标),那么:
paste(tokenize("I want to buy 3D printer, but it costs 3000 dollars.",
removeNumbers = TRUE, simplify = TRUE, removeSeparators = FALSE),
collapse = "")
## [1] "I want to buy 3D printer, but it costs dollars."https://stackoverflow.com/questions/34172253
复制相似问题