首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >用NLTK停止词条删除

用NLTK停止词条删除
EN

Stack Overflow用户
提问于 2016-09-13 15:29:58
回答 1查看 1.2K关注 0票数 1

我一直在和NLTK和数据库分类一起工作。我在停止词条删除方面有问题。当我打印停止词的列表时,所有的单词都在前面列出了“u”。例如:你所有的,你只是‘,你’存在‘,你’结束‘,你’两个‘,你’通过‘我不确定这是正常还是部分问题。

当我打印(1_feats)时,我会得到一个单词列表,其中一些是语料库中列出的单词。

代码语言:javascript
复制
import os
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import stopwords

stopset = list(set(stopwords.words('english')))
morewords = 'delivery', 'shipment', 'only', 'copy', 'attach', 'material'
stopset.append(morewords)

def word_feats(words):
    return dict([(word, True) for word in words.split() if word not in stopset])

ids_1 = {}
ids_2 = {}
ids_3 = {}
ids_4 = {}
ids_5 = {}
ids_6 = {}
ids_7 = {}
ids_8 = {}
ids_9 = {}

path1 = "/Users/myname/Documents/Data Classifier Files/1/"
for name in os.listdir(path1):
    if name[-4:] == '.txt':
        f = open(path1 + "/" + name, "r")
        ids_1[name] = f.read()
        f.close()    

path2 = "/Users/myname/Documents/Data Classifier Files/2/"
for name in os.listdir(path2):
    if name[-4:] == '.txt':
        f = open(path2 + "/" + name, "r")
        ids_2[name] = f.read()
        f.close()    

path3 = "/Users/myname/Documents/Data Classifier Files/3/"
for name in os.listdir(path3):
    if name[-4:] == '.txt':
        f = open(path3 + "/" + name, "r")
        ids_3[name] = f.read()
        f.close()    

path4 = "/Users/myname/Documents/Data Classifier Files/4/"
for name in os.listdir(path4):
    if name[-4:] == '.txt':
        f = open(path4 + "/" + name, "r")
        ids_4[name] = f.read()
        f.close()   

path5 = "/Users/myname/Documents/Data Classifier Files/5/"
for name in os.listdir(path5):
    if name[-4:] == '.txt':
        f = open(path5 + "/" + name, "r")
        ids_5[name] = f.read()
        f.close()     

path6 = "/Users/myname/Documents/Data Classifier Files/6/"
for name in os.listdir(path6):
    if name[-4:] == '.txt':
        f = open(path6 + "/" + name, "r")
        ids_6[name] = f.read()
        f.close()    

path7 = "/Users/myname/Documents/Data Classifier Files/7/"
for name in os.listdir(path7):
    if name[-4:] == '.txt':
        f = open(path7 + "/" + name, "r")
        ids_7[name] = f.read()
        f.close()    

path8 = "/Users/myname/Documents/Data Classifier Files/8/"
for name in os.listdir(path8):
    if name[-4:] == '.txt':
        f = open(path8 + "/" + name, "r")
        ids_8[name] = f.read()
        f.close()   

path9 = "/Users/myname/Documents/Data Classifier Files/9/"
for name in os.listdir(path9):
    if name[-4:] == '.txt':
        f = open(path9 + "/" + name, "r")
        ids_9[name] = f.read()
        f.close()         

feats_1 = [(word_feats(ids_1[f]), '1') for f in ids_1 ]
feats_2 = [(word_feats(ids_2[f]), "2") for f in ids_2 ]
feats_3 = [(word_feats(ids_3[f]), '3') for f in ids_3 ]
feats_4 = [(word_feats(ids_4[f]), '4') for f in ids_4 ]
feats_5 = [(word_feats(ids_5[f]), '5') for f in ids_5 ]
feats_6 = [(word_feats(ids_6[f]), '6') for f in ids_6 ]
feats_7 = [(word_feats(ids_7[f]), '7') for f in ids_7 ]
feats_8 = [(word_feats(ids_8[f]), '8') for f in ids_8 ]
feats_9 = [(word_feats(ids_9[f]), '9') for f in ids_9 ]



trainfeats = feats_1 + feats_2 + feats_3 + feats_4 + feats_5 + feats_6 + feats_7 + feats_8 + feats_9
classifier = NaiveBayesClassifier.train(trainfeats)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-09-13 20:08:32

在执行这三行之后,

代码语言:javascript
复制
stopset = list(set(stopwords.words('english')))
morewords = 'delivery', 'shipment', 'only', 'copy', 'attach', 'material'
stopset.append(morewords)

请看一看stopset (输出缩短):

代码语言:javascript
复制
>>> stopset
[u'all',
 u'just',
 u'being',
 ...
 u'having',
 u'once',
 ('delivery', 'shipment', 'only', 'copy', 'attach', 'material')]

来自morewords的附加条目与前一个单词的级别不同:相反,所有的元组都被看作是一个单一的停止词,这是没有意义的。

原因很简单:list.append()添加了一个元素,list.extend()添加了许多元素。

因此,将stopset.append(morewords)更改为stopset.extend(morewords)

或者更好的是,保持停止词作为一个集合,以便更快地查找。添加多个元素的正确方法是set.update()

代码语言:javascript
复制
stopset = set(stopwords.words('english'))
morewords = ['delivery', 'shipment', 'only', 'copy', 'attach', 'material']
stopset.update(morewords)
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/39473824

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档