文章/答案/技术大牛

发布

社区首页 >问答首页 >在awk中用各自的字符串替换数字

问在awk中用各自的字符串替换数字
EN

Stack Overflow用户

提问于 2021-01-08 14:03:30

回答 4查看 339关注 0票数 1

我是bash/awk编程的新手，我的文件如下所示：

1   10032154    10032154    A   C   Leber_congenital_amaurosis_9    criteria_provided,_single_submitter Benign  .   1
1   10032184    10032184    A   G   Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts    Pathogenic/Likely_pathogenic    .   1,4
1   10032209    10032209    G   A   not_provided    criteria_provided,_single_submitter Likely_benign   .   8,64,512

和awk，我想更改最后一栏中的数字($10)和它们的描述。我在两个不同的数组中分配了数字及其定义。我的想法是通过迭代两个数组来改变这些数字。这里，0是“未知”，1是“胚芽系”，4是“体细胞”并继续下去。

z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")

number=$(IFS=,; echo "${z[*]}")
def=$(IFS=,; echo "${t[*]}")
    
awk -v a="$number" -v b="${def}" 'BEGIN { OFS="\t" } /#/ {next} 
{
    x=split(a, e, /,/)
    y=split(b, f, /,/)
    
    delete c
    m=split($10, c, /,/)
    for (i=1; i<=m; i++) {
        for (j=1; j<=x; j++) {
            if (c[i]==e[j]) {
                c[i]=f[j]
            }
        }
        $10+=sprintf("%s, ",c[i])
    }
    print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
}' input.vcf > output.vcf

输出应该如下所示：

1   10032154    10032154    A   C   Leber_congenital_amaurosis_9    criteria_provided,_single_submitter Benign  .   germline
1   10032184    10032184    A   G   Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts    Pathogenic/Likely_pathogenic    .   germline,paternal
1   10032209    10032209    G   A   not_provided    criteria_provided,_single_submitter Likely_benign   .   paternal,biparental,tested-inconclusive

如果你能帮我，我会很高兴的！

万事如意

bash

awk

回答 4

Stack Overflow用户

回答已采纳

发布于 2021-01-08 15:38:07

假设您实际上不需要将数字和名称列表定义为2个shell数组，其他原因如下：

$ cat tst.awk
BEGIN {
    split("0 1 2 4 8 16 32 64 128 256 512 1024 1073741824",nrsArr)
    split("unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other",namesArr)
    for (i in nrsArr) {
        nr2name[nrsArr[i]] = namesArr[i]
    }
}
!/#/ {
    n = split($NF,nrs,/,/)
    sub(/[^[:space:]]+$/,"")
    printf "%s", $0
    for (i=1; i<=n; i++) {
        printf "%s%s", nr2name[nrs[i]], (i<n ? "," : ORS)
    }
}

$ awk -f tst.awk input.vcf
1   10032154    10032154    A   C   Leber_congenital_amaurosis_9    criteria_provided,_single_submitter Benign  .   germline
1   10032184    10032184    A   G   Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts    Pathogenic/Likely_pathogenic    .   germline,inherited
1   10032209    10032209    G   A   not_provided    criteria_provided,_single_submitter Likely_benign   .   paternal,biparental,tested-inconclusive

以上保留了您在输入文件中的任何空白，以防发生问题。

票数 5

Stack Overflow用户

发布于 2021-01-08 15:33:44

您可以使用此awk

z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")

awk -v z="${z[*]}" -v t="${t[*]}" '
BEGIN {
   split(z, zarr)
   split(t, tarr)
   for (i=1; i in zarr; ++i)
      map[zarr[i]] = tarr[i]
}
{
   split($NF, arr, /,/)
   s = ""
   for (i=1; i in arr; ++i)
      s = s (i == 1 ? "" : ",") map[arr[i]]
   $NF = s;
}
1
' file

顺便说一句，4被映射到inherited，而不是像预期输出中的paternal那样。

票数 4

Stack Overflow用户

发布于 2021-01-08 14:17:10

使用这个简短的Perl内联脚本：

perl -F'\t' -lane '
BEGIN {
    @keys = qw( 0 1 2 4 8 16 32 64 128 256 512 1024 1073741824 );
    @vals = qw( unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other );
    %val = map { $keys[$_] => $vals[$_] } 0..$#keys;
}
print join "\t", @F[0..8], ( join ",", map { $val{$_} } split /,/, $F[9] );
' in_file > out_file

Perl脚本使用以下命令行标志：

-e：告诉Perl在行中查找代码，而不是在文件中。

-n：每次循环输入一行，默认情况下将其分配给$_。

-l：在执行代码行之前，先去掉输入行分隔符(默认情况下是*NIX上的"\n")，然后在打印时追加它。

-a：在空格或在-F选项中指定的正则表达式上将$_拆分为数组@F。

-F'/\t/'：在TAB上拆分为@F，而不是在空格上。

%val = map { $keys[$_] => $vals[$_] } 0..$#keys;：创建%val --一个具有键=数字代码和值=突变/变体类型的哈希查找表。

注意，在Perl中，数组是0索引的.

还请参见：

perldoc perlrun: how to execute the Perl interpreter: command line switches

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65630462

复制

相似问题

问在awk中用各自的字符串替换数字
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在awk中用各自的字符串替换数字EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在awk中用各自的字符串替换数字
EN