我有.txt文件(示例):
专业人员是指从事某一活动或职业的人,以获得或补偿作为谋生手段;例如长期职业,而不是业余或消遣。由于许多专业服务具有个人和机密性质,因此有必要对这些服务给予极大的信任,因此,大多数专业人员必须遵守严格的行为守则,其中规定了严格的道德和道德义务。
如何计算“专业”这个词有多少次?(使用NLTK -是最好的选择吗?)
text_file = open("text.txt", "r+b")发布于 2013-05-28 09:06:52
为了更好地反映你的愿望,我改变了我的答案:
from nltk import word_tokenize
with open('file_path') as f:
content = f.read()
# we will use your text example instead:
content = "A professional is a person who is engaged in a certain activity, or occupation, for gain or compensation as means of livelihood; such as a permanent career, not as an amateur or pastime. Due to the personal and confidential nature of many professional services, and thus the necessity to place a great deal of trust in them, most professionals are subject to strict codes of conduct enshrining rigorous ethical and moral obligations."
def Count_Word(word, data):
c = 0
tokens = word_tokenize(data)
for token in tokens:
token = token.lower()
# this plural check is dangerous, if trying to find a word that ends with an 's'
token = token[:-1] if token[-1] == 's' else token
if token == word:
c += 1
return c
print Count_Word('professional', content)
>>>
3下面是该方法的修改版本:
def Count_Word(word, data, leading=[], trailing=["'s", "s"]):
c = 0
tokens = word_tokenize(data)
for token in tokens:
token = token.lower()
for lead in leading:
if token.startswith(lead):
token = token.partition(lead)[2]
for trail in trailing:
if token.endswith(trail):
token = token.rpartition(trail)[0]
if token == word:
c += 1
return c我添加了可选的参数,这些参数是单词的前导部分或尾部部分的列表,为了找到它,您需要对其进行修改.目前,我只使用默认的's或s。但是,如果你发现别人会适合你,你总是可以添加它们。如果列表开始变长,则可以将它们设置为常量。
发布于 2013-05-28 09:11:32
可以在一行(加上导入)中解决:
>>> from collections import Counter
>>> Counter(w.lower() for w in open("text.txt").read().split())['professional']
2发布于 2013-05-28 09:04:16
你可以简单地标记字符串然后搜索所有的标记..。但这只是一种方式。还有很多其他..。
s = text_file.read()
tokens = nltk.word_tokenize(s)
counter = 0
for token in tokens:
toke = token
if token[-1] == "s":
toke = token[0:-1]
if toke.lower() == "professional":
counter += 1
print counterhttps://stackoverflow.com/questions/16787911
复制相似问题