我已经从Project http://www.gutenberg.org/cache/epub/29765/pg29765.txt下载了以下字典(它是25 MB,所以如果您处于一个缓慢的连接中,请避免单击该链接)
在这个文件中,我要找的关键词是大写的,比如幻觉,然后在字典里有几行关于发音的文章,对我来说已经过时了。
我要提取的是由"Defn“表示的定义,然后打印这些行。我想出了一个很难看的“解决方案”
def lookup(search):
find = search.upper() # transforms our search parameter all upper letters
output = [] # empty dummy list
infile = open('webster.txt', 'r') # opening the webster file for reading
for line in infile:
for part in line.split():
if (find == part):
for line in infile:
if (line.find("Defn:") == 0): # ugly I know, but my only guess so far
output.append(line[6:])
print output # uncertain about how to proceed
break当然,这只会打印在"Defn:“之后出现的第一行。当涉及到在Python中操作.txt文件时,我是新手,因此对如何继续操作一无所知。我确实用元组在行中读到了,并且注意到有一些特殊的新行字符。
因此,我想告诉Python继续阅读,直到它没有新的行字符,但这也不包括必须读取的最后一行。
如果有人能用我可以用来解决这个问题的有用的功能来增强我(最好用一个最小的例子)。
期望输出示例
查找(“幻觉”)
out:流浪;误入歧途;犯错;犯错;-用于心理过程。拜伦。
查找(“幻觉”)
精神错乱症(out):对没有真实感的物体的感知,或由于精神错乱或神经系统而产生的没有相应外部原因的感觉,如精神错乱、震颤、妄想。幻觉总是大脑紊乱的证据,是精神错乱的常见现象。W·A·哈蒙德
案文如下:
HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]
Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]
1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.
2. (Med.)
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]发布于 2014-10-20 17:43:38
下面是返回第一个定义的函数:
def lookup(word):
word_upper = word.upper()
found_word = False
found_def = False
defn = ''
with open('dict.txt', 'r') as file:
for line in file:
l = line.strip()
if not found_word and l == word_upper:
found_word = True
elif found_word and not found_def and l.startswith("Defn:"):
found_def = True
defn = l[6:]
elif found_def and l != '':
defn += ' ' + l
elif found_def and l == '':
return defn
return False
print lookup('hallucination')解释:我们必须考虑四种不同的情况。
Defn:开头的行。如果我们找到它,就将该行添加到定义中(不包括Defn:的六个字符)。如果我们什么都没发现,我们就会报答假的。
注意事项:有某些条目,例如起重机,有多个定义。上面的代码无法处理这个问题。它只会返回第一个定义。然而,考虑到文件的格式,编写一个完美的解决方案并不容易。
发布于 2014-10-20 17:42:51
从here中,我学到了一种处理内存映射文件并将它们当作字符串使用的简单方法。然后你可以用这样的东西来得到一个术语的第一个定义。
def lookup(search):
term = search.upper()
f = open('webster.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
index = s.find('\r\n\r\n' + term + '\r\n')
if index == -1:
return None
definition = s.find('Defn:', index) + len('Defn:') + 1
endline = s.find('\r\n\r\n', definition)
return s[definition:endline]
print lookup('hallucination')
print lookup('hallucinate')假设:
发布于 2014-10-20 18:05:17
您可以拆分为段落并使用搜索词的索引,并在以下位置找到第一个Defn段落:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word)) # find where our search word is
except ValueError:
return "Cannot find search term"
paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
for para in paras:
if para.startswith("Defn:"): # if para startswith Defn: we have what we need
return para # return the para
print(find_def("in.txt","HALLUCINATION"))使用整个文件返回:
In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.
In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.略短的版本:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word))
except ValueError:
return "Cannot find search term"
defn = lines[start:].index("Defn:")
return re.split("\s+\r\n",lines[start+defn:],1)[0]https://stackoverflow.com/questions/26471111
复制相似问题