我使用python2.7nltk标记器来标记一个简单的英文文本,以便提取每个单词的频率及其命名的实体类别。以下程序用于此目的:
import re
from collections import Counter
from nltk.tag.stanford import NERTagger
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))
WORD = re.compile(r'\w+')
def main ():
text = "title Optimal Play against Best Defence: Complexity and
Heuristics"
print text
words = WORD.findall(text)
print words
word_frqc = Counter(words)
tagger = ERTagger("./classifiers/english.all.3class.distsim.crf.ser.gz",
"stanford-ner.jar")
terms = []
answer = tagger.tag(words)
print answer
for i, word_pos in enumerate(answer):
word, pos = word_pos
if pos == 'PERSON':
cat_Id = 1
elif pos == 'ORGANIZATION':
cat_Id = 2
elif pos == 'LOCATION':
cat_Id = 3
else:
cat_Id = 4
frqc =word_frqc.get(word)
terms.append( (i, word, cat_Id, frqc ))
print terms
if __name__ == '__main__':
main()程序的输出如下:
text = "title Optimal Play against Best **Defence:** Complexity and
Heuristics"
[(u'title', u'O'), (u'Optimal', u'O'), (u'Play', u'O'), (u'against', u'O'),
(u'Best', u'O'), (u'Defense', u'O'), (u'Complexity', u'O'), (u'and', u'O'),
(u'Heuristics', u'O')]
[(0, u'title', 4, 1), (1, u'Optimal', 4, 1), (2, u'Play', 4, 1), (3,
u'against', 4, 1), (4, u'Best', 4, 1), (5, u'**Defense**', 4, None), (6,
u'Complexity', 4, 1), (7, u'and', 4, 1), (8, u'Heuristics', 4, 1)]有一个问题是由tagger.tag()方法引起的。该方法将原始文本中的单词“defence”更改为“defence”。因此,程序在word_frqc中看不到“防御”一词,因此将该词在文本中的出现频率设置为“无”。
请问有什么方法(在python中)可以让方法不改变单词吗?
发布于 2016-01-22 11:53:49
https://stackoverflow.com/questions/28360085
复制相似问题