我是bash/awk编程的新手,我的文件如下所示:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . 1
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . 1,4
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . 8,64,512和awk,我想更改最后一栏中的数字($10)和它们的描述。我在两个不同的数组中分配了数字及其定义。我的想法是通过迭代两个数组来改变这些数字。这里,0是“未知”,1是“胚芽系”,4是“体细胞”并继续下去。
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
number=$(IFS=,; echo "${z[*]}")
def=$(IFS=,; echo "${t[*]}")
awk -v a="$number" -v b="${def}" 'BEGIN { OFS="\t" } /#/ {next}
{
x=split(a, e, /,/)
y=split(b, f, /,/)
delete c
m=split($10, c, /,/)
for (i=1; i<=m; i++) {
for (j=1; j<=x; j++) {
if (c[i]==e[j]) {
c[i]=f[j]
}
}
$10+=sprintf("%s, ",c[i])
}
print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
}' input.vcf > output.vcf输出应该如下所示:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,paternal
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive如果你能帮我,我会很高兴的!
万事如意
发布于 2021-01-08 15:38:07
假设您实际上不需要将数字和名称列表定义为2个shell数组,其他原因如下:
$ cat tst.awk
BEGIN {
split("0 1 2 4 8 16 32 64 128 256 512 1024 1073741824",nrsArr)
split("unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other",namesArr)
for (i in nrsArr) {
nr2name[nrsArr[i]] = namesArr[i]
}
}
!/#/ {
n = split($NF,nrs,/,/)
sub(/[^[:space:]]+$/,"")
printf "%s", $0
for (i=1; i<=n; i++) {
printf "%s%s", nr2name[nrs[i]], (i<n ? "," : ORS)
}
}$ awk -f tst.awk input.vcf
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive以上保留了您在输入文件中的任何空白,以防发生问题。
发布于 2021-01-08 15:33:44
您可以使用此awk
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
awk -v z="${z[*]}" -v t="${t[*]}" '
BEGIN {
split(z, zarr)
split(t, tarr)
for (i=1; i in zarr; ++i)
map[zarr[i]] = tarr[i]
}
{
split($NF, arr, /,/)
s = ""
for (i=1; i in arr; ++i)
s = s (i == 1 ? "" : ",") map[arr[i]]
$NF = s;
}
1
' file顺便说一句,4被映射到inherited,而不是像预期输出中的paternal那样。
发布于 2021-01-08 14:17:10
使用这个简短的Perl内联脚本:
perl -F'\t' -lane '
BEGIN {
@keys = qw( 0 1 2 4 8 16 32 64 128 256 512 1024 1073741824 );
@vals = qw( unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other );
%val = map { $keys[$_] => $vals[$_] } 0..$#keys;
}
print join "\t", @F[0..8], ( join ",", map { $val{$_} } split /,/, $F[9] );
' in_file > out_filePerl脚本使用以下命令行标志:
-e:告诉Perl在行中查找代码,而不是在文件中。
-n:每次循环输入一行,默认情况下将其分配给$_。
-l:在执行代码行之前,先去掉输入行分隔符(默认情况下是*NIX上的"\n"),然后在打印时追加它。
-a:在空格或在-F选项中指定的正则表达式上将$_拆分为数组@F。
-F'/\t/':在TAB上拆分为@F,而不是在空格上。
%val = map { $keys[$_] => $vals[$_] } 0..$#keys;:创建%val --一个具有键=数字代码和值=突变/变体类型的哈希查找表。
注意,在Perl中,数组是0索引的.
还请参见:
perldoc perlrun: how to execute the Perl interpreter: command line switches
https://stackoverflow.com/questions/65630462
复制相似问题