首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >根据一行的内容排序所有行

根据一行的内容排序所有行
EN

Stack Overflow用户
提问于 2014-06-15 14:25:22
回答 3查看 99关注 0票数 2

我有一个excel文件的数据与3列。为了便于数据解析,我将数据复制到文本板上,使.txt (制表符分隔)代替了.xls。如果所有行出现在数字间隔之前,我将根据数字间隔的第一个数字号和“补足”一词对所有行进行排序。我的数据如下(其中一些行有空的第3列行1、2、3、5、7、8、9、15):

代码语言:javascript
复制
81228..81500    Gene 22 
81500..81856    Gene 17 
complement(82256..84292)    Gene 75 
84307..85275    Gene 23  2.7.4.8
complement(97435..98067)    Gene 91 
complement(85325..86527)    Gene 34 3.5.1.32
86756..87025    Gene 36 
complement(92373..93341)    Gene 45 
88076..90292    Gene 34 
complement(90289..92415)    Gene 89  3.6.1.-
93492..94931    Gene 92  2.2.1.1
complement(97087..97347)    Gene 12  2.7.7.6
complement(94928..97060)    Gene 58  2.5.6.3, 3.1.8.2
79951..81162    Gene 3   1.7.1.2
complement(87022..87837)    Gene 77
10..1008    Gene 500    
100059..100178  Gene 501    
100470..104660  Gene 502     6.3.5.5
100715..100861  Gene 503    
101721..103604  Gene 504    
101782..103698  Gene 505     1.1.1.49
1018..1137  Gene 506    
10230..11636    Gene 507     6.3.1.2
102328..104049  Gene 508    
10321..12651    Gene 509     2.3.1.16, 2.3.1.9
103256..104290  Gene 510    
103607..104647  Gene 511    
103658..104662  Gene 512     4.1.3.16, 4.1.2.14
103732..106095  Gene 513    
104045..106027  Gene 514    
104057..104305  Gene 515    
10416..14087    Gene 516    
104237..105262  Gene 517     3.1.11.2   

我的预期产出:

代码语言:javascript
复制
79951..81162    Gene 3   1.7.1.2
81228..81500    Gene 22 
81500..81856    Gene 17 
84307..85275    Gene 23  2.7.4.8
86756..87025    Gene 36 
88076..90292    Gene 34 
93492..94931    Gene 92  2.2.1.1
complement(82256..84292)    Gene 75 
complement(85325..86527)    Gene 34 3.5.1.32
complement(87022..87837)    Gene 77 
complement(90289..92415)    Gene 89  3.6.1.-
complement(92373..93341)    Gene 45 
complement(94928..97060)    Gene 58  2.5.6.3, 3.1.8.2
complement(97087..97347)    Gene 12  2.7.7.6
complement(97435..98067)    Gene 91 

我尝试使用python的方法如下:

代码语言:javascript
复制
import re
import sys
#import csv


pattern = '^complement\(\d+\.{2}\d+\)$'

#pattern = '^complement'

regexp = re.compile(pattern)

input_file = open('infile.txt', 'r')

output_file = open('outfile.txt','w')

for line in input_file:
    item = line[0]
    match = regexp.search(item)
    if match:
              output_file.writerow([line[0],\t,line1[1],\t, line[2]])
      #output_file.writerow(line[0])

#del line[0], line[1], line[2], item

del output_file

我不知道怎么从这里开始。有人能帮忙吗!

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2014-06-15 15:00:41

像这样的东西起作用了:

代码语言:javascript
复制
txt='''\
81228..81500    Gene 22 
81500..81856    Gene 17 
complement(82256..84292)    Gene 75 
84307..85275    Gene 23  2.7.4.8
complement(97435..98067)    Gene 91 
complement(85325..86527)    Gene 34 3.5.1.32
86756..87025    Gene 36 
complement(92373..93341)    Gene 45 
88076..90292    Gene 34 
complement(90289..92415)    Gene 89  3.6.1.-
93492..94931    Gene 92  2.2.1.1
complement(97087..97347)    Gene 12  2.7.7.6
complement(94928..97060)    Gene 58  2.5.6.3, 3.1.8.2
79951..81162    Gene 3   1.7.1.2
complement(87022..87837)    Gene 77 '''

import re

lines=txt.splitlines()

print('\n'.join(sorted(lines, key=lambda s: re.search(r'^((?:\d+)|(?:complement\(\d+))', s).group(1))))

或者,如果它是真正分开的选项卡,您可以松开正则表达式并执行以下操作:

代码语言:javascript
复制
txt='''\
81228..81500\tGene 22 
81500..81856\tGene 17 
complement(82256..84292)\tGene 75 
84307..85275\tGene 23\t2.7.4.8
complement(97435..98067)\tGene 91 
complement(85325..86527)\tGene 34\t3.5.1.32
86756..87025\tGene 36 
complement(92373..93341)\tGene 45 
88076..90292\tGene 34 
complement(90289..92415)\tGene 89\t3.6.1.-
93492..94931\tGene 92\t2.2.1.1
complement(97087..97347)\tGene 12\t2.7.7.6
complement(94928..97060)\tGene 58\t2.5.6.3, 3.1.8.2
79951..81162\tGene 3\t1.7.1.2
complement(87022..87837)\tGene 77 '''

lines=txt.splitlines()
print('\n'.join(sorted(lines, key=lambda s: s.split('\t',1)[0])))

但是,由于您正在对第一个元素进行排序,所以根本不需要拆分它:

代码语言:javascript
复制
print('\n'.join(sorted(lines)))

以上任何案件的指纹:

代码语言:javascript
复制
79951..81162    Gene 3   1.7.1.2
81228..81500    Gene 22 
81500..81856    Gene 17 
84307..85275    Gene 23  2.7.4.8
86756..87025    Gene 36 
88076..90292    Gene 34 
93492..94931    Gene 92  2.2.1.1
complement(82256..84292)    Gene 75 
complement(85325..86527)    Gene 34 3.5.1.32
complement(87022..87837)    Gene 77 
complement(90289..92415)    Gene 89  3.6.1.-
complement(92373..93341)    Gene 45 
complement(94928..97060)    Gene 58  2.5.6.3, 3.1.8.2
complement(97087..97347)    Gene 12  2.7.7.6
complement(97435..98067)    Gene 91 

你的一条评论说,你希望这个数字的大小来计数,而不仅仅是字典顺序。

您可以使用自然排序顺序来实现这一点:

代码语言:javascript
复制
import re

data='''\
81728..81500    Gene 22 
81500..81856    Gene 17 
complement(82256..84292)    Gene 75 
812..815    Gene 3 num
complement(822..842)    Gene compliment 3 num75  
811..815    Gene 3 num
'''

def alpha_num_sort(li): 
    def convert(s):
        return int(s) if s.isdigit() else s

    def key_func(key):
        return tuple(convert(c) for c in re.split('([0-9]+)', key))

    return sorted(li, key = key_func)

print '\n'.join(alpha_num_sort(data.splitlines()))    

指纹:

代码语言:javascript
复制
811..815    Gene 3 num
812..815    Gene 3 num
81500..81856    Gene 17 
81728..81500    Gene 22 
complement(822..842)    Gene compliment 3 num75  
complement(82256..84292)    Gene 75 
票数 1
EN

Stack Overflow用户

发布于 2014-06-15 15:01:47

这应该很好,不需要创建txt文件。只需导入csv

代码语言:javascript
复制
In [23]: reader = csv.reader(open('tets.xls','rb'),delimiter='\t')

In [24]: f = list(reader)

In [25]: f #original file
Out[25]: 
[['81228..81500', 'Gene', '22 '],
 ['81500..81856', 'Gene', '17 '],
 ['complement(82256..84292)', 'Gene', '75 '],
 ['84307..85275', 'Genne', '23', '2.7.4.8'],
 ['complement(97435..98067)', 'Gene', '91 '],
 ['complement(85325..86527)', 'Gene', '34', '3.5.1.32'],
 ['86756..87025', 'Gene', '36 '],
 ['complement(92373..93341)', 'Gene', '45 '],
 ['88076..90292', 'Gene', '34 '],
 ['complement(90289..92415)', 'Gene 89', '3.6.1.-'],
 ['93492..94931', 'Genne', '92', '2.2.1.1'],
 ['complement(97087..97347)', 'Gene', '12', '2.7.7.6'],
 ['complement(94928..97060)', 'Gene', '58', '2.5.6.3, 3.1.8.2'],
 ['79951..81162', 'Gene', '3', '1.7.1.2'],
 ['complement(87022..87837)', 'Gene', '77 ']]

In [26]: f.sort(key=lambda x: x[0])

In [27]: f #after sorting
Out[27]: 
[['79951..81162', 'Gene', '3', '1.7.1.2'],
 ['81228..81500', 'Gene', '22 '],
 ['81500..81856', 'Gene', '17 '],
 ['84307..85275', 'Genne', '23', '2.7.4.8'],
 ['86756..87025', 'Gene', '36 '],
 ['88076..90292', 'Gene', '34 '],
 ['93492..94931', 'Genne', '92', '2.2.1.1'],
 ['complement(82256..84292)', 'Gene', '75 '],
 ['complement(85325..86527)', 'Gene', '34', '3.5.1.32'],
 ['complement(87022..87837)', 'Gene', '77 '],
 ['complement(90289..92415)', 'Gene 89', '3.6.1.-'],
 ['complement(92373..93341)', 'Gene', '45 '],
 ['complement(94928..97060)', 'Gene', '58', '2.5.6.3, 3.1.8.2'],
 ['complement(97087..97347)', 'Gene', '12', '2.7.7.
票数 2
EN

Stack Overflow用户

发布于 2014-06-15 15:13:50

我的两分钱:

代码语言:javascript
复制
import csv
# get a list of lines in the file
with open('in.txt') as in_file:
    reader = csv.reader(in_file, delimiter = '\t')
    lines = [line for line in reader]

lines.sort()
# generator to produce the desired output
lines = ('\n'.join(('\t'.join(line) for line in lines)))

with open('out.txt', 'w') as out_file:
    out_file.write(lines)

使用一个关键函数提取相关的排序参数--将上面的lines.sort()替换为:

代码语言:javascript
复制
def key_func(line):
    '''extract sorting parameters

    line --> list
    returns tuple
    '''
    key = line[0]
    complement = 'complement' in key
    # extract the interval
    key = key[key.find('(') + 1 : key.find(')')]
    # turn the interval into a 2 integer list
    key = map(int, key.split('..'))
    # return a tuple to sort on
    return complement, key


lines.sort(key = key_func)

key_func应该返回您希望对数据进行排序的值元组(按重要性排序)。如果您的实际数据与您发布的数据完全匹配,这将对您的需求进行排序。如果实际数据与发布的数据不完全匹配,则需要修改key_func

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/24230536

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档