文章/答案/技术大牛

发布

社区首页 >问答首页 >使用R TM包查找2&3个单词短语

问使用R TM包查找2&3个单词短语
EN

Stack Overflow用户

提问于 2012-01-18 00:53:34

回答 7查看 30.4K关注 0票数 24

我正在尝试找到一个代码，可以在R文本挖掘包中找到最常用的两个或三个单词短语(可能还有另一个我不知道的包)。我一直在尝试使用记号赋予器，但似乎没有成功。

如果你在过去遇到过类似的情况，你能发布一个经过测试并实际工作的代码吗？非常感谢!

data-mining

text-mining

回答 7

Stack Overflow用户

发布于 2012-01-18 11:17:48

你可以把一个自定义的标记化函数传递给tm的DocumentTermMatrix函数，所以如果你已经安装了tau包，那就相当简单了。

library(tm); library(tau);

tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))

texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))

其中，tokenize_ngrams函数中的n是每个短语的字数。此功能也在package RTextTools中实现，这进一步简化了事情。

library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)

这将返回一个与包tm一起使用的DocumentTermMatrix类。

票数 11

Stack Overflow用户

发布于 2013-05-30 11:52:47

这是tm包的FAQ的第5部分：

5.我可以在术语文档矩阵中使用二元语法而不是单个符号吗？

是。RWeka为任意n元语法提供了一个标记器，可以直接传递给术语文档矩阵构造器。例如：

  library("RWeka")
  library("tm")

  data("crude")

  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

  inspect(tdm[340:345,1:10])

票数 8

Stack Overflow用户

发布于 2012-01-18 06:37:04

这是我为不同目的编造的作品，但我认为可能也适用于您的需求：

#User Defined Functions
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))

strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
    strp <- function(x, digit.remove, apostrophe.remove){
        x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
        x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
        ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
    }
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, 
    apostrophe.remove = apostrophe.remove)) ))
}

unblanker <- function(x)subset(x, nchar(x)>0)

#Fake Text Data
x <- "I like green eggs and ham.  They are delicious.  They taste so yummy.  I'm talking about ham and eggs of course"

#The code using Base R to Do what you want
breaker(x)
strip(x)
words <- unblanker(breaker(strip(x)))
textDF <- as.data.frame(table(words))
textDF$characters <- sapply(as.character(textDF$words), nchar)
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
rownames(textDF2) <- 1:nrow(textDF2)
textDF2
subset(textDF2, characters%in%2:3)

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/8898521

复制

相似问题

问使用R TM包查找2&3个单词短语
EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用R TM包查找2&3个单词短语EN

回答 7

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用R TM包查找2&3个单词短语
EN