文章/答案/技术大牛

发布

社区首页 >问答首页 >用于文本分类的nltk naivebayes分类器

问用于文本分类的nltk naivebayes分类器
EN

Stack Overflow用户

提问于 2016-09-06 22:38:23

回答 1查看 816关注 0票数 3

在下面的代码中，我知道我的naivebayes分类器工作正常，因为它在trainset1上工作正常，但是为什么它不能在trainset2上工作？我甚至在两个分类器上尝试了它，一个来自TextBlob，另一个直接来自nltk。

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob
from nltk.tokenize import word_tokenize
import nltk

trainset1 = [('I love this sandwich.', 'pos'),
('This is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('This is my best work.', 'pos'),
("What an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('He is my sworn enemy!', 'neg'),
('My boss is horrible.', 'neg')]

trainset2 = [('hide all brazil and everything plan limps to anniversary inflation plan initiallyis limping its first anniversary amid soaring prices', 'class1'),
         ('hello i was there and no one came', 'class2'),
         ('all negative terms like sad angry etc', 'class2')]

def nltk_naivebayes(trainset, test_sentence):
    all_words = set(word.lower() for passage in trainset for word in word_tokenize(passage[0]))
    t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in trainset]
    classifier = nltk.NaiveBayesClassifier.train(t)
    test_sent_features = {word.lower(): (word in word_tokenize(test_sentence.lower())) for word in all_words}
    return classifier.classify(test_sent_features)

def textblob_naivebayes(trainset, test_sentence):
    cl = NaiveBayesClassifier(trainset)
    blob = TextBlob(test_sentence,classifier=cl)
    return blob.classify() 

test_sentence1 = "he is my horrible enemy"
test_sentence2 = "inflation soaring limps to anniversary"

print nltk_naivebayes(trainset1, test_sentence1)
print nltk_naivebayes(trainset2, test_sentence2)
print textblob_naivebayes(trainset1, test_sentence1)
print textblob_naivebayes(trainset2, test_sentence2)

输出：

neg
class2
neg
class2

尽管test_sentence2显然属于class1。

text-classification

document-classification

machine-learning

nlp

nltk

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-09-06 23:11:51

我假设你理解，你不能期望分类器只用3个例子就能学习一个好的模型，并且你的问题更多地是为了理解为什么它在这个特定的例子中这样做。

它这样做的可能原因是朴素贝叶斯分类器使用先验类概率。也就是说，与文本无关，neg与pos的概率。在你的例子中，2/3的例子是否定的，因此之前的neg是66%，pos是33%。你的单个正面实例中的正面单词是“周年纪念”和“飙升”，这不太可能足以补偿这种前一类概率。

特别要注意的是，单词概率的计算涉及到各种“平滑”函数(例如，在每一类中将是log10(词频+ 1)，而不是log10(词频)，以防止低频词对分类结果产生太大影响，被零除等。因此，与您可能预期的不同，“周年”和“飙升”的概率对于neg不是0.0，对于pos不是1.0。

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39351735

复制

相似问题

问用于文本分类的nltk naivebayes分类器
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于文本分类的nltk naivebayes分类器EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于文本分类的nltk naivebayes分类器
EN