我试图在R中运行朴素的Bayes来根据文本数据(通过构建文档术语矩阵)进行预测。
我读过几篇关于培训和测试集中可能缺失的术语的警告文章,所以我决定只使用一个数据框架,然后再分割它。我使用的代码是:
data <- read.csv(file="path",header=TRUE)
########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)
# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])
# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)
# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)
# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)
# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
completecorpus <- tm_map(completecorpus,PlainTextDocument)
completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
completecorpus <- tm_map(completecorpus,removePunctuation)
completecorpus <- tm_map(completecorpus,removeNumbers)
completecorpus <- tm_map(completecorpus,stripWhitespace)
# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]
# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)
# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))
conf.matrix问题是我得到了这样奇怪的结果:
actual
predicted 1 2 3
1 60 833 107
2 0 0 0
3 0 0 0知道为什么会发生这种事吗?
原始数据如下所示:
head(complete)
Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer. easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer. I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well
InfoType
13000 2
13001 2
13002 2
13003 3
13004 2
13005 2发布于 2016-05-01 17:30:15
问题似乎在于,TDM需要摆脱这么多的稀疏性。所以我补充道:
completematrix<-removeSparseTerms(completematrix, 0.95)它开始起作用了!
actual
predicted 1 2 3
1 60 511 6
2 0 86 2
3 0 236 99谢谢大家的想法(谢谢切尔西希尔!!)
https://stackoverflow.com/questions/36959387
复制相似问题