我有选项卡分隔的文件,如下所示:
CNV_chr1_12623251_12632176 8925 3 RR123 XX
CNV_chr1_13398757_13402091 3334 4 RR123 YY
CNV_chr1_13398757_13402091 3334 4 RR224 YY
CNV_chr1_14001365_14004064 2699 1 RR123 YX
CNV_chr1_14001365_14004064 2699 1 RR224 YX列$1和$2保持不变。在这种情况下,我需要通过使用第4列中的值进行索引来删除重复的行。并在$4中添加一个以逗号分隔字符串数的额外$5。示例输出如下所示:
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
CNV_chr1_13398757_13402091 3334 4 RR123,RR124 2 YY
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX任何工作的解决都会有帮助。
发布于 2016-04-27 17:28:49
试试这个:
awk '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt产出:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX对于以制表符分隔的输入,只需将-F"\t"添加到awk
awk -F"\t" '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt并得到:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XXhttps://stackoverflow.com/questions/36896390
复制相似问题