我使用下面的代码来标记字符串,从stdin读取。
d=[]
cur = ''
for i in sys.stdin.readline():
if i in ' .':
if cur not in d and (cur != ''):
d.append(cur)
cur = ''
else:
cur = cur + i.lower()这给了我一个不重复的words.However数组,在我的输出中,有些单词不会分裂。
我的意见是
Dan went to the north pole to lead an expedition during summer.输出数组d为
“丹”,“去”,“to”,“北方”,“北极”,“托勒德”,“安”,“远征”,“在”期间,“夏天”
为什么tolead在一起?
发布于 2013-07-29 17:50:20
尝尝这个
d=[]
cur = ''
for i in sys.stdin.readline():
if i in ' .':
if cur not in d and (cur != ''):
d.append(cur)
cur = '' # note the different indentation
else:
cur = cur + i.lower()发布于 2013-07-29 17:55:33
试试这个:
for line in sys.stdin.readline():
res = set(word.lower() for word in line[:-1].split(" "))
print res示例:
line = "Dan went to the north pole to lead an expedition during summer."
res = set(word.lower() for word in line[:-1].split(" "))
print res
set(['north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'summer', 'the'])在注释之后,我编辑:此解决方案保留输入顺序并过滤分隔符。
import re
from collections import OrderedDict
line = "Dan went to the north pole to lead an expedition during summer."
list(OrderedDict.fromkeys(re.findall(r"[\w']+", line)))
# ['Dan', 'went', 'to', 'the', 'north', 'pole', 'lead', 'an', 'expedition', 'during', 'summer']发布于 2013-07-29 18:01:32
"to"已经在"d"了。因此,您的循环跳过了"to"和"lead"之间的空间,但是继续连接;一旦它到达下一个空间,它就会看到"tolead"不在d中,所以它会追加它。
更容易解决;它还去掉了所有形式的标点符号:
>>> import string
>>> set("Dan went to the north pole to lead an expedition during summer.".translate(None, string.punctuation).lower().split())
set(['summer', 'north', 'lead', 'expedition', 'dan', 'an', 'to', 'pole', 'during', 'went', 'the'])https://stackoverflow.com/questions/17930734
复制相似问题