我有以下两个文件
BC.txt
"PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
"PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;PB.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";我正在尝试比较Col1 of BC.txt和Col12 of PB.txt,并相互打印匹配项。对于col1中相同的值,BC.txt在col2和Col3中有不同的值。因此,在比较时,我只获得一个BC.txt条目的输出。但我想要所有人。
awk 'BEGIN {OFS=FS} NR==FNR {a[$1]=($2" "$3);next} $12 in a {print $0,a[$12]}' BC.txt PB.txt预期产出
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;我想将BC.txt的所有条目与PB.txt的条目进行比较;但是由于它的值相同,所以我的代码无法工作。
发布于 2019-09-08 21:08:55
如果与问题中的预期输出相比,您不关心输出行顺序,那么将BC.txt读入内存,因为它更简短:
$ cat tst.awk
NR==FNR {
map[$1,++cnt[$1]] = $2 OFS $3
next
}
{
for (c=1; c<=cnt[$12]; c++) {
print $0, map[$12,c]
}
}
$ awk -f tst.awk BC.txt PB.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;但如果你在乎的话:
$ cat tst.awk
NR==FNR {
map[$12,++cnt[$12]] = $0
next
}
{
for (c=1; c<=cnt[$1]; c++) {
print map[$1,c], $2, $3
}
}
$ awk -f tst.awk PB.txt BC.txt
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;发布于 2019-09-08 21:21:32
你能用join来做这个吗?(如果对列进行了排序,或者通过sort <() )。
$ join BC.txt <(awk '{print $12,$0}' PB.txt) | cut -d' ' -f 4-
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";把你想要的列从连接中剪掉?
发布于 2019-09-08 18:09:21
请您试一试(仅用您提供的样品进行测试)。
awk '
FNR==NR{
a[++count]=$0
b[count]=$12
next
}
{
for(i=1;i<=count;i++){
split(a[i],array," ")
if($1==array[12]){
print a[i],$2,$3
}
}
}' PB.txt BC.txt解释:现在添加对上述代码的解释。
awk ' ##Starting awk program here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when PB.txt is being read.
a[++count]=$0 ##Creating an array named a whose index is variable count with incrment value of 1 and value is current line.
b[count]=$12 ##Creating an array named b whose index is variabe count and value if 12th column.
next ##next will skip all further statements from here.
}
{
for(i=1;i<=count;i++){ ##Starting a for loop from here from i=1 to till value of count.
split(a[i],array," ") ##Splitting value of a[i] into array named array whose delimiter is space.
if($1==array[12]){ ##Checking condition if $1 is equal to array[12] then do following.
print a[i],$2,$3 ##Printing array a value along with 2nd and 3rd column value.
}
}
}' PB.txt BC.txt ##Mentioning Input_files names here.https://stackoverflow.com/questions/57843928
复制相似问题