首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >朴素贝叶斯的问题

朴素贝叶斯的问题
EN

Stack Overflow用户
提问于 2016-04-30 19:39:16
回答 1查看 340关注 0票数 0

我试图在R中运行朴素的Bayes来根据文本数据(通过构建文档术语矩阵)进行预测。

我读过几篇关于培训和测试集中可能缺失的术语的警告文章,所以我决定只使用一个数据框架,然后再分割它。我使用的代码是:

代码语言:javascript
复制
data <- read.csv(file="path",header=TRUE)

########## NAIVE BAYES
library(e1071)
library(SparseM)
library(tm)

# CREATE DATA FRAME AND TRAINING AND
# TEST INCLUDING 'Text' AND 'InfoType' (columns 8 and 27)
traindata <- as.data.frame(data[13000:13999,c(8,27)])
testdata <- as.data.frame(data[14000:14999,c(8,27)])
complete <- as.data.frame(data[13000:14999,c(8,27)])

# SEPARATE TEXT VECTOR TO CREATE Source(),
# Corpus() CONSTRUCTOR FOR DOCUMENT TERM
# MATRIX TAKES Source()
completevector <- as.vector(complete$Text)

# CREATE SOURCE FOR VECTORS
completesource <- VectorSource(completevector)

# CREATE CORPUS FOR DATA
completecorpus <- Corpus(completesource)

# STEM WORDS, REMOVE STOPWORDS, TRIM WHITESPACE
completecorpus <- tm_map(completecorpus,tolower)
        completecorpus <- tm_map(completecorpus,PlainTextDocument)
        completecorpus <- tm_map(completecorpus, stemDocument)
completecorpus <- tm_map(completecorpus, removeWords,stopwords("english"))
        completecorpus <- tm_map(completecorpus,removePunctuation)
        completecorpus <- tm_map(completecorpus,removeNumbers)
        completecorpus <- tm_map(completecorpus,stripWhitespace)

# CREATE DOCUMENT TERM MATRIX
completematrix<-DocumentTermMatrix(completecorpus)
trainmatrix <- completematrix[1:1000,]
testmatrix <- completematrix[1001:2000,]

# TRAIN NAIVE BAYES MODEL USING trainmatrix DATA AND traindata$InfoType CLASS VECTOR
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$InfoType),laplace=1)

# PREDICTION
results <- predict(model,as.matrix(testmatrix))
conf.matrix<-table(results, testdata$InfoType,dnn=list('predicted','actual'))

conf.matrix

问题是我得到了这样奇怪的结果:

代码语言:javascript
复制
               actual
predicted    1   2   3
         1  60 833 107
         2   0   0   0
         3   0   0   0

知道为什么会发生这种事吗?

原始数据如下所示:

代码语言:javascript
复制
head(complete)

      Text
13000 Milkshakes, milkshakes, whats not to love? Really like the durability and weight of the cup. Something about it sure makes good milkshakes.Works beautifully with the Cuisinart smart stick.
13001 excellent. shipped on time, is excellent for protein shakes with a cuisine art mixer.  easy to clean and the mixer fits in perfectly
13002 Great cup. Simple and stainless steel great size cup for use with my cuisinart mixer.  I can do milkshakes really easy and fast. Recommended. No problems with the shipping.
13003 Wife Loves This. Stainless steel....attractive and the best part is---it won't break. We are considering purchasing another one because they are really nice.
13004 Great! Stainless steel cup is great for smoothies, milkshakes and even chopping small amounts of vegetables for salads!Wish it had a top but still love it!
13005 Great with my. Stick mixer...the plastic mixing container cracked and became unusable as a result....the only downside is you can't see if the stuff you are mixing is mixed well 

      InfoType
13000        2
13001        2
13002        2
13003        3
13004        2
13005        2
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-05-01 17:30:15

问题似乎在于,TDM需要摆脱这么多的稀疏性。所以我补充道:

代码语言:javascript
复制
completematrix<-removeSparseTerms(completematrix, 0.95)

它开始起作用了!

代码语言:javascript
复制
             actual
predicted   1   2   3
        1  60 511   6
        2   0  86   2
        3   0 236  99

谢谢大家的想法(谢谢切尔西希尔!!)

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/36959387

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档