我有一个excel文件的数据与3列。为了便于数据解析,我将数据复制到文本板上,使.txt (制表符分隔)代替了.xls。如果所有行出现在数字间隔之前,我将根据数字间隔的第一个数字号和“补足”一词对所有行进行排序。我的数据如下(其中一些行有空的第3列行1、2、3、5、7、8、9、15):
81228..81500 Gene 22
81500..81856 Gene 17
complement(82256..84292) Gene 75
84307..85275 Gene 23 2.7.4.8
complement(97435..98067) Gene 91
complement(85325..86527) Gene 34 3.5.1.32
86756..87025 Gene 36
complement(92373..93341) Gene 45
88076..90292 Gene 34
complement(90289..92415) Gene 89 3.6.1.-
93492..94931 Gene 92 2.2.1.1
complement(97087..97347) Gene 12 2.7.7.6
complement(94928..97060) Gene 58 2.5.6.3, 3.1.8.2
79951..81162 Gene 3 1.7.1.2
complement(87022..87837) Gene 77
10..1008 Gene 500
100059..100178 Gene 501
100470..104660 Gene 502 6.3.5.5
100715..100861 Gene 503
101721..103604 Gene 504
101782..103698 Gene 505 1.1.1.49
1018..1137 Gene 506
10230..11636 Gene 507 6.3.1.2
102328..104049 Gene 508
10321..12651 Gene 509 2.3.1.16, 2.3.1.9
103256..104290 Gene 510
103607..104647 Gene 511
103658..104662 Gene 512 4.1.3.16, 4.1.2.14
103732..106095 Gene 513
104045..106027 Gene 514
104057..104305 Gene 515
10416..14087 Gene 516
104237..105262 Gene 517 3.1.11.2 我的预期产出:
79951..81162 Gene 3 1.7.1.2
81228..81500 Gene 22
81500..81856 Gene 17
84307..85275 Gene 23 2.7.4.8
86756..87025 Gene 36
88076..90292 Gene 34
93492..94931 Gene 92 2.2.1.1
complement(82256..84292) Gene 75
complement(85325..86527) Gene 34 3.5.1.32
complement(87022..87837) Gene 77
complement(90289..92415) Gene 89 3.6.1.-
complement(92373..93341) Gene 45
complement(94928..97060) Gene 58 2.5.6.3, 3.1.8.2
complement(97087..97347) Gene 12 2.7.7.6
complement(97435..98067) Gene 91 我尝试使用python的方法如下:
import re
import sys
#import csv
pattern = '^complement\(\d+\.{2}\d+\)$'
#pattern = '^complement'
regexp = re.compile(pattern)
input_file = open('infile.txt', 'r')
output_file = open('outfile.txt','w')
for line in input_file:
item = line[0]
match = regexp.search(item)
if match:
output_file.writerow([line[0],\t,line1[1],\t, line[2]])
#output_file.writerow(line[0])
#del line[0], line[1], line[2], item
del output_file我不知道怎么从这里开始。有人能帮忙吗!
发布于 2014-06-15 15:00:41
像这样的东西起作用了:
txt='''\
81228..81500 Gene 22
81500..81856 Gene 17
complement(82256..84292) Gene 75
84307..85275 Gene 23 2.7.4.8
complement(97435..98067) Gene 91
complement(85325..86527) Gene 34 3.5.1.32
86756..87025 Gene 36
complement(92373..93341) Gene 45
88076..90292 Gene 34
complement(90289..92415) Gene 89 3.6.1.-
93492..94931 Gene 92 2.2.1.1
complement(97087..97347) Gene 12 2.7.7.6
complement(94928..97060) Gene 58 2.5.6.3, 3.1.8.2
79951..81162 Gene 3 1.7.1.2
complement(87022..87837) Gene 77 '''
import re
lines=txt.splitlines()
print('\n'.join(sorted(lines, key=lambda s: re.search(r'^((?:\d+)|(?:complement\(\d+))', s).group(1))))或者,如果它是真正分开的选项卡,您可以松开正则表达式并执行以下操作:
txt='''\
81228..81500\tGene 22
81500..81856\tGene 17
complement(82256..84292)\tGene 75
84307..85275\tGene 23\t2.7.4.8
complement(97435..98067)\tGene 91
complement(85325..86527)\tGene 34\t3.5.1.32
86756..87025\tGene 36
complement(92373..93341)\tGene 45
88076..90292\tGene 34
complement(90289..92415)\tGene 89\t3.6.1.-
93492..94931\tGene 92\t2.2.1.1
complement(97087..97347)\tGene 12\t2.7.7.6
complement(94928..97060)\tGene 58\t2.5.6.3, 3.1.8.2
79951..81162\tGene 3\t1.7.1.2
complement(87022..87837)\tGene 77 '''
lines=txt.splitlines()
print('\n'.join(sorted(lines, key=lambda s: s.split('\t',1)[0])))但是,由于您正在对第一个元素进行排序,所以根本不需要拆分它:
print('\n'.join(sorted(lines)))以上任何案件的指纹:
79951..81162 Gene 3 1.7.1.2
81228..81500 Gene 22
81500..81856 Gene 17
84307..85275 Gene 23 2.7.4.8
86756..87025 Gene 36
88076..90292 Gene 34
93492..94931 Gene 92 2.2.1.1
complement(82256..84292) Gene 75
complement(85325..86527) Gene 34 3.5.1.32
complement(87022..87837) Gene 77
complement(90289..92415) Gene 89 3.6.1.-
complement(92373..93341) Gene 45
complement(94928..97060) Gene 58 2.5.6.3, 3.1.8.2
complement(97087..97347) Gene 12 2.7.7.6
complement(97435..98067) Gene 91 你的一条评论说,你希望这个数字的大小来计数,而不仅仅是字典顺序。
您可以使用自然排序顺序来实现这一点:
import re
data='''\
81728..81500 Gene 22
81500..81856 Gene 17
complement(82256..84292) Gene 75
812..815 Gene 3 num
complement(822..842) Gene compliment 3 num75
811..815 Gene 3 num
'''
def alpha_num_sort(li):
def convert(s):
return int(s) if s.isdigit() else s
def key_func(key):
return tuple(convert(c) for c in re.split('([0-9]+)', key))
return sorted(li, key = key_func)
print '\n'.join(alpha_num_sort(data.splitlines())) 指纹:
811..815 Gene 3 num
812..815 Gene 3 num
81500..81856 Gene 17
81728..81500 Gene 22
complement(822..842) Gene compliment 3 num75
complement(82256..84292) Gene 75 发布于 2014-06-15 15:01:47
这应该很好,不需要创建txt文件。只需导入csv:
In [23]: reader = csv.reader(open('tets.xls','rb'),delimiter='\t')
In [24]: f = list(reader)
In [25]: f #original file
Out[25]:
[['81228..81500', 'Gene', '22 '],
['81500..81856', 'Gene', '17 '],
['complement(82256..84292)', 'Gene', '75 '],
['84307..85275', 'Genne', '23', '2.7.4.8'],
['complement(97435..98067)', 'Gene', '91 '],
['complement(85325..86527)', 'Gene', '34', '3.5.1.32'],
['86756..87025', 'Gene', '36 '],
['complement(92373..93341)', 'Gene', '45 '],
['88076..90292', 'Gene', '34 '],
['complement(90289..92415)', 'Gene 89', '3.6.1.-'],
['93492..94931', 'Genne', '92', '2.2.1.1'],
['complement(97087..97347)', 'Gene', '12', '2.7.7.6'],
['complement(94928..97060)', 'Gene', '58', '2.5.6.3, 3.1.8.2'],
['79951..81162', 'Gene', '3', '1.7.1.2'],
['complement(87022..87837)', 'Gene', '77 ']]
In [26]: f.sort(key=lambda x: x[0])
In [27]: f #after sorting
Out[27]:
[['79951..81162', 'Gene', '3', '1.7.1.2'],
['81228..81500', 'Gene', '22 '],
['81500..81856', 'Gene', '17 '],
['84307..85275', 'Genne', '23', '2.7.4.8'],
['86756..87025', 'Gene', '36 '],
['88076..90292', 'Gene', '34 '],
['93492..94931', 'Genne', '92', '2.2.1.1'],
['complement(82256..84292)', 'Gene', '75 '],
['complement(85325..86527)', 'Gene', '34', '3.5.1.32'],
['complement(87022..87837)', 'Gene', '77 '],
['complement(90289..92415)', 'Gene 89', '3.6.1.-'],
['complement(92373..93341)', 'Gene', '45 '],
['complement(94928..97060)', 'Gene', '58', '2.5.6.3, 3.1.8.2'],
['complement(97087..97347)', 'Gene', '12', '2.7.7.发布于 2014-06-15 15:13:50
我的两分钱:
import csv
# get a list of lines in the file
with open('in.txt') as in_file:
reader = csv.reader(in_file, delimiter = '\t')
lines = [line for line in reader]
lines.sort()
# generator to produce the desired output
lines = ('\n'.join(('\t'.join(line) for line in lines)))
with open('out.txt', 'w') as out_file:
out_file.write(lines)使用一个关键函数提取相关的排序参数--将上面的lines.sort()替换为:
def key_func(line):
'''extract sorting parameters
line --> list
returns tuple
'''
key = line[0]
complement = 'complement' in key
# extract the interval
key = key[key.find('(') + 1 : key.find(')')]
# turn the interval into a 2 integer list
key = map(int, key.split('..'))
# return a tuple to sort on
return complement, key
lines.sort(key = key_func)key_func应该返回您希望对数据进行排序的值元组(按重要性排序)。如果您的实际数据与您发布的数据完全匹配,这将对您的需求进行排序。如果实际数据与发布的数据不完全匹配,则需要修改key_func。
https://stackoverflow.com/questions/24230536
复制相似问题