我编写了一个函数从大型文本文件中提取特定的文本块,示例文本如下所示:
ATP(1):C39(3) - A:TYR(58):CD2(67)
ATP(1):C39(3) - A:TYR(58):CE2(69)
ATP(1):C59(6) - A:ILE(61):CD1(100)
ATP(1):C59(6) - A:LYS(87):CE(344)
Hydrogen bonds:
Location of Donor | Sidechain/Backbone | Secondary Structure | Count
-------------------|--------------------|---------------------|-------
LIGAND | SIDECHAIN | OTHER | 1
RECEPTOR | BACKBONE | BETA | 1
Raw data:
ATP(1):O2A(9) - A:ILE(61):HN(93) - A:ILE(61):N(92)
Hydrophobic contacts (C-C):
Sidechain/Backbone | Secondary Structure | Count
--------------------|---------------------|-------
SIDECHAIN | OTHER | 2
SIDECHAIN | BETA | 23
Raw data:
ATP(1):C39(3) - A:TYR(58):CD2(67)
ATP(1):C39(3) - A:TYR(58):CE2(69)
ATP(1):C59(6) - A:ILE(61):CD1(100)
ATP(1):C59(6) - A:LYS(87):CE(344)
ATP(1):C4(23) - A:PHE(209):CD1(1562)
ATP(1):C4(23) - A:PHE(209):CE1(1564)
ATP(1):C2(26) - A:PHE(209):CD2(1563)
ATP(1):C6(28) - A:PHE(209):CB(1560)
ATP(1):C6(28) - A:PHE(209):CG(1561)
ATP(1):C6(28) - A:PHE(209):CD1(1562)
ATP(1):C6(28) - A:VAL(286):CG2(2266)
pi-pi stacking interactions:
ATP(1):C8(30) - A:LYS(87):CG(342)
ATP(1):C8(30) - A:GLU(159):CD(1066)
ATP(1):C8(30) - A:PHE(209):CE1(1564)我编写了一个函数来提取块:
from itertools import islice
def start_end_points(file_name):
f = open(file_name)
lines = f.readlines()
for s, line in enumerate(lines):
if "Hydrogen bonds:" in line:
print s
for e, line in enumerate(lines):
if "pi-pi stacking interactions:" in line:
print e
print islice(lines, s, e)
start_end_points("foo.txt")有没有更有效地编写这段代码的方法?因为我想将这段代码作为Web工具的一部分,因此代码的效率非常重要。
谢谢。
发布于 2017-03-01 10:59:57
您没有理由将整个文件加载到内存中!
def start_end_points(file_name):
with open(file_name) as f:
found = False
for line in f:
if found or ("Hydrogen bonds:" in line):
found = True
print line
if "pi-pi stacking interactions:" in line:
break
start_end_points("foo.txt")这样,在内存中只保留一个缓冲区,每一行处理一次,并在到达pi-pi后立即停止读取文件.线路。
发布于 2017-03-01 10:47:48
你甚至不用把所有的行都存起来!
with高效自动关闭文件,因此它非常高效和有用。
注意的两个选项--如果这都是关于效率的,那么选择第一个选项。
--我建议return对行进行调整,而不是对其进行print处理--也许您会在其中有更多的用途,然后您就可以再次打印,不再运行整个函数。
def start_end_points(file_name):
wanted_text = ""
# USE this way -EFFICIENT!
with open(file_name, "r") as f:
found = False
for line in f:
if found:
if "pi-pi stacking interactions:" in line:
break
else:
wanted_text += line
if "Hydrogen bonds:" in line:
wanted_text += line
found = True
# OR use this way *less efficient memory speaking*, but pythonic
with open(file_name, "r") as f:
all = f.read().split('\n')
numbers = [i for i, line in enumerate(all) if "Hydrogen bonds:" in line or "pi-pi stacking interactions:" in line]
wanted_text = all[numbers[0]:numbers[1]]
# eventually, return:
return wanted_text
data = start_end_points("foo.txt")发布于 2017-03-01 10:57:48
我认为这更有效,因为您可以在f上迭代,这样您就可以保存这个列表转换lines = f.readlines()。此外,此代码只在数据(使用2个while循环)中运行一次,其中您的代码对循环使用了2,这两个循环都运行到文件的末尾。
from pprint import pprint
def start_end_points(file_name):
f = open(file_name)
single_line = next(f)
while "Hydrogen bonds:" not in single_line:
single_line = next(f)
result = []
while "pi-pi stacking interactions:" not in single_line:
result.append(single_line.rstrip())
single_line = next(f)
f.close()
pprint(result)需要注意的是:打开文件后,您仍然可以修改它。因此,在while循环中读到的行可能不是打开f时想到的行。
产出比:
['Hydrogen bonds:',
' Location of Donor | Sidechain/Backbone | Secondary Structure | Count',
' -------------------|--------------------|---------------------|-------',
' LIGAND | SIDECHAIN | OTHER | 1',
'',
' RECEPTOR | BACKBONE | BETA | 1',
'',
'Raw data:',
' ATP(1):O2A(9) - A:ILE(61):HN(93) - A:ILE(61):N(92)',
'',
'Hydrophobic contacts (C-C):',
' Sidechain/Backbone | Secondary Structure | Count',
' --------------------|---------------------|-------',
' SIDECHAIN | OTHER | 2',
' SIDECHAIN | BETA | 23',
'',
'Raw data:',
' ATP(1):C39(3) - A:TYR(58):CD2(67)',
' ATP(1):C39(3) - A:TYR(58):CE2(69)',
' ATP(1):C59(6) - A:ILE(61):CD1(100)',
' ATP(1):C59(6) - A:LYS(87):CE(344)',
' ATP(1):C4(23) - A:PHE(209):CD1(1562)',
' ATP(1):C4(23) - A:PHE(209):CE1(1564)',
' ATP(1):C2(26) - A:PHE(209):CD2(1563)',
' ATP(1):C6(28) - A:PHE(209):CB(1560)',
' ATP(1):C6(28) - A:PHE(209):CG(1561)',
' ATP(1):C6(28) - A:PHE(209):CD1(1562)',
' ATP(1):C6(28) - A:VAL(286):CG2(2266)',
'']https://stackoverflow.com/questions/42528774
复制相似问题