我有一个包含其他文件夹的文件夹,每个文件夹都包含很多文本文件,大约有32214个文件。我想在一个特定的单词之前和之后打印5个单词,我的代码应该阅读下面所有的files.The代码,但是阅读所有的文件和摘录句子需要大约8个小时。如何更改代码,使其在几分钟内读取和打印句子?(语言为波斯语)
.
.
.
def extact_sentence ():
f= open ("پاکت", "w", encoding = "utf-8")
y = "پاکت"
text= normal_text(folder_path) # the first function to normalize the files
for i in text:
for line in i:
split_line = line.split()
if y in split_line:
index = split_line.index(y)
d = (' '.join(split_line[max(0,index-5):min(index+6,len(split_line))]))
f.write(d + "\n")
f.close()发布于 2016-12-15 18:25:40
使用os.walk访问所有文件。然后在每个文件上使用滚动窗口,并检查每个窗口的中间单词:
import os
def getRollingWindow(seq, w):
win = [next(seq) for _ in range(window_size)]
yield win
for e in seq:
win[:-1] = win[1:]
win[-1] = e
yield win
def extractSentences(rootDir, searchWord):
with open("پاکت", "w", encoding="utf-8") as outfile:
for root, _dirs, fnames in os.walk(rootDir):
for fname in fnames:
print("Looking in", os.path.join(root, fname))
with open(os.path.join(root, fname)) as infile:
for window in getRollingWindow(word for line in infile for word in line.split(), 11):
if window[5] != searchWord: continue
outfile.write(' '.join(window))https://stackoverflow.com/questions/41170573
复制相似问题