我有一个带有两列的文件input.txt,我希望将第二列拆分为";“,并将唯一的条目转接起来,然后计数并列出第1列中有多少匹配项。
这是我的标签分隔的input.txt文件
Gene Biological_Process
BALF2 metabolic process
CHD4 cell organization and biogenesis;metabolic process;regulation of biological process
TCOF1 cell organization and biogenesis;regulation of biological process;transport
TOP1 cell death;cell division;cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
BcLF1 0
BALF5 metabolic process
MTA2 cell organization and biogenesis;metabolic process;regulation of biological process
MSH6 cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus我期待的output1
Biological_Process Gene
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell death TOP1
cell division TOP1
response to stimulus TOP1 MSH6发布于 2022-11-11 21:06:17
$ cat script.awk
#! /usr/bin/awk -f
BEGIN {
FS = "[\t;]"; # sep can be a regex
OFS = "\t"
}
NR>1 && /^[A-Z]/{ # skip header & blank lines
for(i=NF; i>1; i--)
if($i) # skip empty bio-proc
a[$i] = a[$i] OFS $1
}
END{
print "Biological_Process","Gene(s)"
for(x in a)
print x a[x]
}
$ ./script.awk input.dat
Biological_Process Gene(s)
cell death TOP1
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell division TOP1
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
response to stimulus TOP1 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6发布于 2022-11-09 05:29:06
您需要首先解析所有数据,例如,从空白字典开始,然后读取文件的每一行(如果是头) open your file ... iterate over each line,对于列>0中的每一个条目,使用split、strip和dict[gene...] = process...等字符串方法为该字符串创建一个字典键,其值为column =0。然后从dict中打印/写出每个.items:
input.txt
gene process
A cell org bio
B cell bio
C 0
D orgscript.py
#!/usr/bin/env python
def main():
pros = {}
with open("input.txt", "r") as ifile:
for line in ifile:
cols = line.strip().split()
if len(cols) >= 1:
for pro in cols[1:]:
if pro not in pros:
pros[pro] = []
pros[pro] += [cols[0]]
with open("output.txt", "w") as ofile:
for key,val in pros.items():
ofile.writelines(f'{key}\t' + '\t'.join(val) + '\n')
if __name__ == "__main__":
main()跑
$ chmod +x ./script.py
$ ./script.py
$ cat ./output.txtoutput.txt
process gene
cell A B
org A D
bio A B
0 Chttps://stackoverflow.com/questions/74369833
复制相似问题