首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >将两个文件与重复输入值进行比较

将两个文件与重复输入值进行比较
EN

Stack Overflow用户
提问于 2019-09-08 16:54:24
回答 3查看 82关注 0票数 1

我有以下两个文件

BC.txt

代码语言:javascript
复制
"PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
"PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;

PB.txt

代码语言:javascript
复制
c4  PB  tr  41258945    41270445    .   +   .   g_i "PB.50262"; t_i "PB.50262.10";
c4  PB  Ex  41258945    41259026    .   +   .   g_i "PB.50262"; t_i "PB.50262.10";
c4  PB  Ex  41259626    41259754    .   +   .   g_i "PB.50262"; t_i "PB.50262.10";
c4  PB  Ex  41262664    41262814    .   +   .   g_i "PB.50262"; t_i "PB.50262.10";

我正在尝试比较Col1 of BC.txt和Col12 of PB.txt,并相互打印匹配项。对于col1中相同的值,BC.txt在col2和Col3中有不同的值。因此,在比较时,我只获得一个BC.txt条目的输出。但我想要所有人。

代码语言:javascript
复制
    awk 'BEGIN {OFS=FS} NR==FNR {a[$1]=($2" "$3);next} $12 in a {print $0,a[$12]}' BC.txt PB.txt

预期产出

代码语言:javascript
复制
c4  PB  tr  41258945    41270445    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41258945    41259026    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41259626    41259754    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41262664    41262814    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  tr  41258945    41270445    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41258945    41259026    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41259626    41259754    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41262664    41262814    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;

我想将BC.txt的所有条目与PB.txt的条目进行比较;但是由于它的值相同,所以我的代码无法工作。

EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2019-09-08 21:08:55

如果与问题中的预期输出相比,您不关心输出行顺序,那么将BC.txt读入内存,因为它更简短:

代码语言:javascript
复制
$ cat tst.awk
NR==FNR {
    map[$1,++cnt[$1]] = $2 OFS $3
    next
}
{
    for (c=1; c<=cnt[$12]; c++) {
        print $0, map[$12,c]
    }
}

$ awk -f tst.awk BC.txt PB.txt
c4  PB  tr  41258945    41270445    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  tr  41258945    41270445    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41258945    41259026    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41258945    41259026    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41259626    41259754    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41259626    41259754    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41262664    41262814    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41262664    41262814    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;

但如果你在乎的话:

代码语言:javascript
复制
$ cat tst.awk
NR==FNR {
    map[$12,++cnt[$12]] = $0
    next
}
{
    for (c=1; c<=cnt[$1]; c++) {
        print map[$1,c], $2, $3
    }
}

$ awk -f tst.awk PB.txt BC.txt
c4  PB  tr  41258945    41270445    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41258945    41259026    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41259626    41259754    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  Ex  41262664    41262814    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AGCGGCCT; BC=TTTCAGCGCCGA;
c4  PB  tr  41258945    41270445    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41258945    41259026    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41259626    41259754    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
c4  PB  Ex  41262664    41262814    .   +   .   g_i "PB.50262"; t_i "PB.50262.10"; UMI=AAGCGGCC; BC=TTTCAGCGCCGA;
票数 1
EN

Stack Overflow用户

发布于 2019-09-08 21:21:32

你能用join来做这个吗?(如果对列进行了排序,或者通过sort <() )。

代码语言:javascript
复制
$ join BC.txt <(awk '{print $12,$0}' PB.txt) | cut -d' ' -f 4-
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB tr 41258945 41270445 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41258945 41259026 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41259626 41259754 . + . g_i "PB.50262"; t_i "PB.50262.10";
c4 PB Ex 41262664 41262814 . + . g_i "PB.50262"; t_i "PB.50262.10";

把你想要的列从连接中剪掉?

票数 1
EN

Stack Overflow用户

发布于 2019-09-08 18:09:21

请您试一试(仅用您提供的样品进行测试)。

代码语言:javascript
复制
awk '
FNR==NR{
  a[++count]=$0
  b[count]=$12
  next
}
{
  for(i=1;i<=count;i++){
    split(a[i],array," ")
    if($1==array[12]){
       print a[i],$2,$3
    }
  }
}'  PB.txt BC.txt

解释:现在添加对上述代码的解释。

代码语言:javascript
复制
awk '                         ##Starting awk program here.
FNR==NR{                      ##Checking condition FNR==NR which will be TRUE when PB.txt is being read.
  a[++count]=$0               ##Creating an array named a whose index is variable count with incrment value of 1 and value is current line.
  b[count]=$12                ##Creating an array named b whose  index is variabe count and value if 12th column.
  next                        ##next will skip all further statements from here.
}
{
  for(i=1;i<=count;i++){      ##Starting a for loop from here from i=1 to till value of count.
    split(a[i],array," ")     ##Splitting value of a[i] into array named array whose delimiter is space.
    if($1==array[12]){        ##Checking condition if $1 is equal to array[12] then do following.
       print a[i],$2,$3       ##Printing array a value along with 2nd and 3rd column value.
    }
  }
}'  PB.txt BC.txt             ##Mentioning Input_files names here.
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/57843928

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档