为了准备主成分分析( PCA ),我想把一个两列文件转换成一个0和1的表。输入文件由第一列中的细菌名称和第二列中的细菌描述符组成。
可能的方法:将输入文件存储在散列中,然后对每个列执行某种“uniq”命令,并将它们添加到输出文件中。要完成输出文件中的每个组合,请在文件1哈希中添加0或一个if细菌名称和描述符。
输入文件(制表符分隔):
bacteria_1 protein:plasmid:149679
bacteria_1 protein:proph:183386
bacteria_2 protein:proph:183386
bacteria_3 protein:plasmid:147856
bacteria_3 protein:proph:183386期望输出(制表符分隔):
protein:plasmid:149679 protein:proph:183386 protein:plasmid:147856
bacteria_1 1 1 0
bacteria_2 0 1 0
bacteria_3 0 1 1发布于 2014-02-24 15:33:57
下面是使用GNU awk的一种方法:
awk '{
header[$2]++;
bacteria[$1]++;
map[$1,$2]++
}
END {
x=asorti(header,header_s);
for(i=1;i<=x;i++) {
printf "\t%s\t", header_s[i]
}
print ""
y=asorti(bacteria,bacteria_s);
for(j=1;j<=y;j++) {
printf "%s\t\t", bacteria_s[j];
for (z=1;z<=x;z++) {
printf "%s\t\t\t\t", (map[bacteria_s[j],header_s[z]])?"1":"0"
}
print ""
}
}' file
protein:plasmid:147856 protein:plasmid:149679 protein:proph:183386
bacteria_1 0 1 1
bacteria_2 0 0 1
bacteria_3 1 0 1下面是使用常规awk的解决方案:
awk '
!is_present[$1]++ {bacteria[++x] = $1}
!is_present[$2]++ {protein[++y] = $2}
{map[$1,$2]++}
END {
for(i=1; i<=y; i++) {
printf "\t%s\t", protein[i]
}
print "";
for(j=1; j<=x; j++) {
printf "%s\t\t", bacteria[j];
for(a=1; a<=y; a++) {
printf "%s\t\t\t\t", (map[bacteria[j], protein[a]])?"1":"0"
}
print ""
}
}' file发布于 2014-02-24 14:24:40
快速python脚本:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import fileinput
from collections import defaultdict
output = defaultdict(list)
proteins = set()
for line in fileinput.input():
bacteria, protein = line.strip().split()
proteins.update([protein])
output[bacteria].append(protein)
# Print header
print ' '*12,
for header in sorted(proteins):
print '{:25}'.format(header),
print
# Print table
for key in output:
print '{:12}'.format(key),
for header in sorted(proteins):
if header in output[key]:
print '{:22}'.format(1),
else:
print '{:22}'.format(0),
print产出:
$ python table.py inputfile
protein:plasmid:147856 protein:plasmid:149679 protein:proph:183386
bacteria_2 0 0 1
bacteria_3 1 0 1
bacteria_1 0 1 1https://stackoverflow.com/questions/21989763
复制相似问题