我想要制作一个程序,用点系统对垃圾邮件进行排序。
在邮件里写了几句话,
我想让程序给我的程序中的每个单词不同的分数,在我的程序中被归类为“垃圾词”,我也为不同的单词分配了不同的分数,这样每个单词都值一些分数。
我的伪码:
- for each word that comes up give the point the word is worth.
示例(文本文件):
Hello!
Do you have trouble sleeping?
Do you need to rest?
Then dont hesitate call us for the absolute solution- without charge!因此,当程序运行并分析上面的文本时,应该如下所示:
SPAM 14p
trouble 6p
charge 3p
solution 5p 所以我打算用这种方式来写:
class junk(object):
fil = open("filnamne.txt","r")
junkwords = {"trouble":"6p","solution":"3p","virus":"4p"}
words = junkwords
if words in fil:
print("SPAM")
else:
print("The file doesn't contain any junk")所以,我现在的问题是,我如何给我的名单中的每个单词,在文件中的分数?
以及如何把总点数加在一起,以便if total_points are > 10然后程序应该print "SPAM",
后面是文件中的“废词”列表和每个单词的总分。
发布于 2013-03-04 13:31:14
下面是一个可以让您接近的快速脚本:
MAXPOINTS = 10
JUNKWORDS={"trouble":6,"solution":5,"charge":3,"virus":7}
fil = open("filnamne.txt", "r")
foundwords = {}
points = 0
for word in fil.read().split():
if word in JUNKWORDS:
if word not in foundwords:
foundwords[word] = 0
points += JUNKWORDS[word]
foundwords[word] += 1
if points > 10:
print "SPAM"
for word in foundwords:
print word, foundwords[word]*JUNKWORDS[word]
else:
print "The file doesn't contain any junk"您可能希望对这些单词使用.lower(),并将所有字典键设置为小写。也可以删除所有非字母数字字符。
发布于 2013-03-04 13:38:06
以下是另一种方法:
from collections import Counter
word_points = {'trouble': 6, 'solution': 5, 'charge': 3, 'virus': 7}
words = []
with open('ham.txt') as f:
for line in f:
if line.strip(): # weed out empty lines
for word in line.split():
words.append(word)
count_of_words = Counter(words)
total_points = {}
for word in word_points:
if word in count_of_words:
total_points[word] = word_points[word] * count_of_words[word]
if sum(i[0] for i in total_points.iteritems()) > 10:
print 'SPAM {}'.format(sum(i[0] for i in total_points.iteritems()))
for i in total_points.iteritems():
print 'Word: {} Points: {}'.format(*i)您可以做一些优化,但是它应该给您一个一般逻辑的概念。Counter可从Python2.7及更高版本获得。
发布于 2013-03-04 13:46:57
我假设每个单词都有不同的点,所以我使用了一个字典。
您需要在文件中找到单词在单词中出现的次数。
您应该将每个单词的点存储为整数。不是'6p'或'4p'
所以,试试这个:
def find_junk(filename):
word_points = {"trouble":6,"solution":3,"charge":2,"virus":4}
word_count = {word:0 for word in word_points}
count = 0
found = []
with open(filename) as f:
for line in f:
line = line.lower()
for word in word_points:
c = line.count(word)
if c > 0:
count += c * word_points[word]
found.append(word)
word_count[word] += c
if count >= 10:
print ' SPAM'*4
for word in found:
print '%10s%3s%3s' % (word, word_points[word], word_count[word])
else:
print "Not spam"
find_junk('spam.txt')https://stackoverflow.com/questions/15202457
复制相似问题