我想在一列中找到特定的字符串和字符串的组合。你能帮帮我吗?
投入:
benign,likely_pathogenic
benign,likely_pathogenic
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
uncertain_significance,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
uncertain_significance,conflicting_interpretations_of_pathogenicity,likely_benign
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
pathogenic产出:
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic我想把包含致病性和likely_pathogenic的每一栏分开。但有一部分是在conflicting_interpretations_of_pathogenicity.中引起的。我试过了
awk -F'\t' -v OFS="\t" '{if($14=="pathogenic") print FILENAME,$0; else if($14=="likely_pathogenic") print FILENAME,$0}' 但它用于列中的确切字符串。
如果我试过:
awk -F'\t' -v OFS="\t" '{if($14~"pathogenic") print FILENAME,$0}'我得到了所有关于致病性,likely_pathogenic和conflicting_interpretations_of_pathogenicity.的行一排可能是冲突的组合..。致病性或likely_pathogenic。
发布于 2022-07-20 14:52:06
就像这样,也许:
awk '{
split($0,a,/,/) # split NEEDED field on commas
for(i in a) # check each part
if(a[i]~/^(likely_)?pathogenic$/) { # if matches this regex
print # output
break # no need for more matches
}
}' file一些产出:
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
...显然,您需要添加FS等,就像在处理NF==14的示例代码中一样。
编辑:
我想这也适用于发布的样本数据:
$ awk '/(^|,)(likely_)?pathogenic(,|$)/' file或者你的假设数据:
$ awk '$14~/(^|,)(likely_)?pathogenic(,|$)/' file发布于 2022-07-21 09:09:13
我将利用GNU AWK的字界来完成这项任务,如下所示,让file.txt内容成为
benign,likely_pathogenic
benign,likely_pathogenic
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
uncertain_significance,likely_benign,conflicting_interpretations_of_pathogenicity
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
uncertain_significance,conflicting_interpretations_of_pathogenicity,likely_benign
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
benign,conflicting_interpretations_of_pathogenicity
pathogenic然后
/pathogenic\y/{print}给予输出
benign,likely_pathogenic
benign,likely_pathogenic
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
benign,likely_pathogenic
benign,likely_pathogenic
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
conflicting_interpretations_of_pathogenicity,_other,benign,pathogenic,likely_benign,conflicting_interpretations_of_pathogenicity
risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,conflicting_interpretations_of_pathogenicity
pathogenic,likely_pathogenic
pathogenic说明:单词边界(\y)是零长度断言,它可以放在前面、后面或前后,首先给出单词的开头,第二个以单词结尾,第三个是完整的单词。所以pathogen\y的意思是以pathogen结尾的单词。GNU AWK将单词定义为一个或多个字母、数字或下划线的序列。注意:输出与所需的第4行risk_factor行不同,但是它符合描述,因为该行包含,pathogenic,。
(在gawk 4.2.1中测试)
发布于 2022-07-21 13:19:55
最好的情况(它还没有完成),我可以快速进入无需使用word边界正则表达式:
回声"${input….}“\ mawk '$!(NF=NF)~ /非典/‘\ FS='^,*致病[:alpha:]*’OFS=
1 benign,likely_pathogenic
2 benign,likely_pathogenic
3 risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
4 risk_factor,uncertain_significance,likely_pathogenic,uncertain_significance,_other,benign
5 risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,
6 pathogenic,not_provided,benign,likely_pathogenic,likely_benign,risk_factor
7 benign,likely_pathogenic
8 benign,likely_pathogenic
9 ,_other,benign,pathogenic,likely_benign,
10 ,_other,benign,pathogenic,likely_benign,
11 risk_factor,benign,likely_benign,drug_response,not_provided,uncertain_significance,pathogenic,uncertain_significance,_other,
12 pathogenic,likely_pathogenic
13 pathogenic它可能删除了第9-10行周围太多的内容。
https://stackoverflow.com/questions/73053519
复制相似问题