我有一个基因组坐标文件,结构如下:
chromosome1|25000|35000_chromosome1|400|600
chromosome4|78000|80000_chromosome2|43000|45000我想对每条线上的两个条目进行排序,首先按照较低的基因组坐标排序,如果它们属于同一条染色体(例如,第1行),或者首先按照数目较低的染色体排序,如果它们位于不同的染色体上。期望产出:
chromosome1|400|600_chromosome1|25000|35000
chromosome2|43000|45000_chromosome4|78000|80000我试过以下几种方法,但奇怪的是,它并不总是正确工作!
cat file | awk 'BEGIN{OFS="\t"}{split($1,a,"_chr"); a[2]="chr" a[2]; str=$1; if(a[1]>a[2]) str=a[2]"_"a[1]; print str,$2}'
有人能帮忙吗?提前谢谢!
发布于 2020-11-25 04:01:37
请您试一试:
awk 'BEGIN {FS = OFS = "_"} # use "_" as a delimiter
{
split($1, a, "\\|") # split left genomic coordinates with "|" and assign array "a"
split($2, b, "\\|") # split right genomic coordinates with "|" and assign array "b"
if (a[1] == b[1]) { # if they belong to the same chromosome
if (a[2] < b[2]) print $1, $2 # then compare lower genomic coordinates
else print $2, $1
} else { # they belong to different chromosomes
sub(/^[^0-9]+/, "", a[1]) # extract chromosome number and overwrite a[1]
sub(/^[^0-9]+/, "", b[1]) # extract chromosome number and overwrite b[1]
if (a[1]+0 < b[1]+0) print $1, $2 # then compare the numbers
else print $2, $1
}
}' file给定示例文件的输出:
chromosome1|400|600_chromosome1|25000|35000
chromosome2|43000|45000_chromosome4|78000|80000https://stackoverflow.com/questions/64998080
复制相似问题