我想使用python从一个大文件中提取某些信息。我有3个输入文件。第一个输入文件(input_file)是数据文件,它是一个3列选项卡分隔的文件,如下所示:
engineer-n imposition-n 2.82169386609e-05
motor-n imposition-n 0.000102011705117
creature-n imposition-n 0.000121321951973
bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05
liability-n oppression-n 0.012845281978
currency-n oppression-n 0.000793989880202第二个输入文件(colA_file)是一个1列列表,如下所示:
bomb-n
sedation-n
roadblock-n
surrender-n第三个输入文件(colB_file)也是一个1列列表(对具有不同信息的colA_file无效),如下所示:
adjective-n
homeless-n
imposition-n
oppression-n我想从colA和colB中找到的输入文件中提取信息。使用我提供的示例数据,这意味着过滤除以下行之外的所有信息:
bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05为了解决这个问题,我用Python编写了以下代码:
def test_fnc(input_file, colA_file, colB_file, output_file):
nounA = []
with open(colA_file, "rb") as opened_colA:
for aLine in opened_colA:
nounA.append(aLine.strip())
#print nounA
nounB = []
with open(colB_file, "rb") as opened_colB:
for bLine in opened_colB:
nounB.append(bLine.strip())
#print nounB
with open(output_file, "wb") as outfile:
with open(input_file, "rb") as opened_input:
for cLine in opened_input:
splitted_cLine = cLine.split()
#print splitted_cLine
if splitted_cLine[0] in nounA and splitted_cLine[1] in nounB:
outstring = "\t".join(splitted_cLine)
outfile.write(outstring + "\n")
test_fnc(input_file, colA_file, colB_file, output_file)但是,它只输出1行,好像它没有迭代所提供的列表输入。我的列表似乎也是相互附加的,从一个项开始,然后在每个附加项中递增。因此,我亦尝试参考以下名单:
for bLine in opened_colB:
nounB = bLine结果和上面一样。
发布于 2014-05-08 13:32:15
如果您不介意依赖项,我将使用pandas或numpy。使用pandas.DataFrame,您可以对其列执行isin检查。否则,我建议使用集合,因为regex应该要慢得多。就像这样:
with open(colA_file, "rb") as file_h:
noun_a = set(line.strip() for line in file_h)
with open(colB_file, "rb") as file_h:
noun_b = set(line.strip() for line in file_h)
with open(output_file, "wb") as outfile:
with open(input_file, "rb") as opened_input:
for line in opened_input:
split_line = line.split()
if split_line[0] in noun_a and split_line[1] in noun_b:
outfile.write(line)发布于 2014-05-08 11:32:24
import re
nounA=[]
with open('col1.txt', "rb") as opened_colA:
for aLine in opened_colA:
nounA.append(aLine.strip())
patterns = [r'\b%s\b' % re.escape(s.strip()) for s in nounA]
col1 = re.compile('|'.join(patterns))
nounB=[]
with open('col2.txt', "rb") as opened_colA:
for aLine in opened_colA:
nounB.append(aLine.strip())
patterns = [r'\b%s\b' % re.escape(s.strip()) for s in nounB]
col2 = re.compile('|'.join(patterns))
with open('test1.txt', "rb") as opened_colA:
for aLine in opened_colA:
if col1.search(aLine):
if col2.search(aLine):
print aLine
# just write aline to your output file.解释:首先,我使用colA中的所有单词并生成正则表达式;与col2类似。现在,使用这个正则表达式,我搜索输入文件并打印结果。
'\b'是词的边界。如果您正在搜索一个单词'cat',但它可能会找到'catch',那么'\b'对于只查找单词'cat'是有用的。
https://stackoverflow.com/questions/23540309
复制相似问题