首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >awk -在列中找到确切的字符串

awk -在列中找到确切的字符串
EN

Stack Overflow用户
提问于 2022-07-20 14:31:02
回答 4查看 65关注 0票数 2

我想在一列中找到特定的字符串和字符串的组合。你能帮帮我吗?

投入:

代码语言:javascript
复制
benign,likely_pathogenic
benign,likely_pathogenic
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
uncertain_significance,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
uncertain_significance,conflicting_interpretations_of_pathogenicity,likely_benign
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
pathogenic

产出:

代码语言:javascript
复制
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic

我想把包含致病性和likely_pathogenic的每一栏分开。但有一部分是在conflicting_interpretations_of_pathogenicity.中引起的。我试过了

代码语言:javascript
复制
awk -F'\t' -v OFS="\t" '{if($14=="pathogenic") print FILENAME,$0; else if($14=="likely_pathogenic") print FILENAME,$0}' 

但它用于列中的确切字符串。

如果我试过:

代码语言:javascript
复制
awk -F'\t' -v OFS="\t" '{if($14~"pathogenic") print FILENAME,$0}'

我得到了所有关于致病性,likely_pathogenic和conflicting_interpretations_of_pathogenicity.的行一排可能是冲突的组合..。致病性或likely_pathogenic。

EN

回答 4

Stack Overflow用户

回答已采纳

发布于 2022-07-20 14:52:06

就像这样,也许:

代码语言:javascript
复制
awk '{
    split($0,a,/,/)                          # split NEEDED field on commas
    for(i in a)                              # check each part
        if(a[i]~/^(likely_)?pathogenic$/) {  # if matches this regex
            print                            # output
            break                            # no need for more matches
        }
}' file

一些产出:

代码语言:javascript
复制
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
...

显然,您需要添加FS等,就像在处理NF==14的示例代码中一样。

编辑:

我想这也适用于发布的样本数据:

代码语言:javascript
复制
$ awk '/(^|,)(likely_)?pathogenic(,|$)/' file

或者你的假设数据:

代码语言:javascript
复制
$ awk '$14~/(^|,)(likely_)?pathogenic(,|$)/' file
票数 3
EN

Stack Overflow用户

发布于 2022-07-21 09:09:13

我将利用GNU AWK字界来完成这项任务,如下所示,让file.txt内容成为

代码语言:javascript
复制
benign,likely_pathogenic
benign,likely_pathogenic
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
uncertain_significance,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
uncertain_significance,conflicting_interpretations_of_pathogenicity,likely_benign
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
pathogenic

然后

代码语言:javascript
复制
/pathogenic\y/{print}

给予输出

代码语言:javascript
复制
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic

说明:单词边界(\y)是零长度断言,它可以放在前面、后面或前后,首先给出单词的开头,第二个以单词结尾,第三个是完整的单词。所以pathogen\y的意思是以pathogen结尾的单词。GNU AWK将单词定义为一个或多个字母、数字或下划线的序列。注意:输出与所需的第4行risk_factor行不同,但是它符合描述,因为该行包含,pathogenic,

(在gawk 4.2.1中测试)

票数 2
EN

Stack Overflow用户

发布于 2022-07-21 13:19:55

最好的情况(它还没有完成),我可以快速进入无需使用word边界正则表达式:

回声"${input….}“\ mawk '$!(NF=NF)~ /非典/‘\ FS='^,*致病[:alpha:]*’OFS=

代码语言:javascript
复制
 1  benign,likely_pathogenic
 2  benign,likely_pathogenic
 3  risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
 4  risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
 5  risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,
 6  pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
 7  benign,likely_pathogenic
 8  benign,likely_pathogenic
 9  ,_other,benign,pathogenic,likely_benign,
10  ,_other,benign,pathogenic,likely_benign,
11  risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,
12  pathogenic,likely_pathogenic
13  pathogenic

它可能删除了第9-10行周围太多的内容。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73053519

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档