我有两个文件,文件A如下所示:
>MA0003.1_TFAP2A
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>MA0004.1_Arnt
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>MA0006.1_Arnt::Ahr
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
>MA0006.1_Arntr
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
>MA0006.1_ArntAh
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028
>MA0006.1_Arnt::A
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028文件B,它看起来如下(请注意,fileB也有空格,每一行的最后一个字都很重要):
AP-2 TFAP2A
AXUD class 1 Arnt
AXU 2 Arnt::Ahr
AXU Arntr
AXU ArntAh
AXU Arnt::A我希望第三个文件是文件A和B的组合,以便调整从文件A开始的名称头,如下所示:
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>Axu 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028What I have done接受文件A并提取第二个名称,该名称由下划线(_)分隔,如下所示:
awk '/>/' <input_for_clustering.pwm | tr '_' '\t' | awk '{print $2}' > temp然后在B文件中检查第二个文件中是否存在这些名称并提取它,如下所示:
for i in `cat temp`
do
cat fileB | awk '{ if (($2=="'$i'")) {print $1 }}'>>data_res
done现在的问题是如何编辑文件A?
请帮帮忙。
我希望,我展示了我所付出的努力和想法。
发布于 2014-09-09 07:42:24
试试这个:
awk 'NR==FNR{z=$NF;$NF="";a[z]=$0;next}
/^>/{split($0,b,"_");if (b[2] in a){print ">"a[b[2]]}next}1' fileB fileA结果:
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>AXU 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028发布于 2014-09-09 07:44:29
我觉得这能做你想做的事
BEGIN { FS = "\t" }
NR==FNR { a[$2] = $1; next }
/^>/ { for (i in a) if ($0 ~ i "$") $0 = ">" a[i] }
{ print $0 }当总记录号等于当前文件的记录号(即我们在第一个文件中)时,构建一个包含替换的数组a。next跳过脚本的其余部分,转到下一行。
对于以">“开头的第二个文件中的行,请遍历a的键,找到匹配的键并对行进行替换。我已经添加了一个锚点$,所以模式必须在行的末尾。{ print $0 }打印整行(这可以缩写为1 )。
测试一下:
$ awk -f swap.awk replace file
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>AXU 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028发布于 2014-09-09 10:12:26
这里有一个Perl解决方案。它看起来有点神秘,因为它依赖于几个正则表达式。
策略是先处理FileB,然后构建一个在FileA中转换字符串的散列。
所有输出都被发送到STDOUT。
use strict;
use warnings;
use 5.010;
use autodie;
my %fb = do {
open my ($fh), '<', 'FileB.txt';
reverse map / ( \S+ (?: \s+ \S+ )* ) \s+ (\S+) /x, <$fh>;
};
open my ($fh), '<', 'FileA.txt';
while ( <$fh> ) {
s/^>\K[^_]*_(\S+).*/$fb{$1}/;
print;
}输出
>AP-2
5.4052885343e-06 5.4052885343e-06 0.999983784134 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
5.4052885343e-06 0.999983784134 5.4052885343e-06 5.4052885343e-06
0.118921753043 0.383780891224 0.248648677866 0.248648677866
0.10270588744 0.308106851744 0.329728005881 0.259459254935
0.0486530020973 0.421617910964 0.427023199498 0.10270588744
>AXUD class 1
0.200009998 0.799890021996 4.99900019996e-05 4.99900019996e-05
0.949860027994 4.99900019996e-05 0.0500399920016 4.99900019996e-05
4.99900019996e-05 4.99900019996e-05 4.99900019996e-05 0.999850029994
4.99900019996e-05 4.99900019996e-05 0.999850029994 4.99900019996e-05
>AXU 2
0.125020829862 0.333319446759 0.0833611064823 0.458298616897
4.16597233794e-05 4.16597233794e-05 0.95821529745 0.0417013831028
4.16597233794e-05 0.95821529745 4.16597233794e-05 0.0417013831028https://stackoverflow.com/questions/25738841
复制相似问题