首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >标识模式之间的字符串,如果找到字符串,则在模式之间打印整个区域。正确地使用awk

标识模式之间的字符串,如果找到字符串,则在模式之间打印整个区域。正确地使用awk
EN

Unix & Linux用户
提问于 2022-10-20 13:59:48
回答 2查看 408关注 0票数 3

有类似的问题,但没有一个完全解决我的问题。

简单地说,我需要打印包含任何字符串的每一个块。每个块开始行包含:

详情见下文:

我想搜索一个庞大的文件(数十万行),如果在模式区域(块)中识别了某个字符串,则在模式之间打印每个区域(块)。

我可以在模式之间打印整个区域,其中这些块的开始和结束标识符是"/awk '/<entry version=/{flag=1} flag; /<entry version=/{flag=0}'。

但是,如果在这些模式之间找到了特定的字符串,那么如何使它只打印整个块呢?

实际数据的最短部分对于块区域(实际上每个块有数千行长)是这样的,我想感谢Terdon为我使用的一个更好的示例进行排序:

代码语言:javascript
复制
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">   
        <name>TSPAN6</name>                                                                                                                             
        <synonym>T245</synonym>
        <synonym>TM4SF6</synonym>
        <synonym>TSPAN-6</synonym>
        <identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
                <xref id="O43657" db="Uniprot/SWISSPROT"/> 
                <xref id="7105" db="NCBI GeneID"/>
        </identifier>  
        <proteinClasses>   
                <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>     

在上面的真实格式中,我将检查特定字符串的名称和同义词,所以如果我要查找"TSPAN6“,那么这个块就会被打印出来。每个块都有上千行,所以下面只是一个由迷你组成的版本,来解释如何根据块中的字符串匹配来打印块。

下面是一个例子,如果我的字符串是"MEMSAT“和"TNMD”

示例输入:

代码语言:javascript
复制
 <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
        <name>TSPAN6</name>
        <synonym>T245</synonym>
        <synonym>TM4SF6</synonym>
        <synonym>TSPAN-6</synonym>
        <identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
            <xref id="O43657" db="Uniprot/SWISSPROT"/>
            <xref id="7105" db="NCBI GeneID"/>
        </identifier>
        <proteinClasses>
            <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
            <proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
        </proteinClasses>
        <proteinEvidence evidence="Evidence at protein level">
            <evidence source="HPA" evidence="Evidence at transcript level"/>
            <evidence source="MS" evidence="Not available"/>
            <evidence source="UniProt" evidence="Evidence at protein level"/>
        </proteinEvidence>
  </entry>
    <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
        <name>TNMD</name>
        <synonym>BRICD4</synonym>
        <synonym>ChM1L</synonym>
        <synonym>myodulin</synonym>
        <synonym>TEM</synonym>
        <synonym>tendin</synonym>
        <identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
            <xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
            <xref id="64102" db="NCBI GeneID"/>
        </identifier>
        <proteinClasses>
            <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
            <proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
            <proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
        </proteinClasses>
        <proteinEvidence evidence="Evidence at protein level">
            <evidence source="HPA" evidence="Evidence at transcript level"/>
            <evidence source="MS" evidence="Not available"/>
            <evidence source="UniProt" evidence="Evidence at protein level"/>
        </proteinEvidence>
  </entry> 

示例输出:

代码语言:javascript
复制
    <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
        <name>TNMD</name>
        <synonym>BRICD4</synonym>
        <synonym>ChM1L</synonym>
        <synonym>myodulin</synonym>
        <synonym>TEM</synonym>
        <synonym>tendin</synonym>
        <identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
            <xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
            <xref id="64102" db="NCBI GeneID"/>
        </identifier>
        <proteinClasses>
            <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
            <proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
            <proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
        </proteinClasses>
        <proteinEvidence evidence="Evidence at protein level">
            <evidence source="HPA" evidence="Evidence at transcript level"/>
            <evidence source="MS" evidence="Not available"/>
            <evidence source="UniProt" evidence="Evidence at protein level"/>
        </proteinEvidence>
  </entry>  
EN

回答 2

Unix & Linux用户

回答已采纳

发布于 2022-10-20 16:19:32

您可以在GNU中使用</entry[^>]*>作为记录分隔符。例如,使用此文件作为输入:

代码语言:javascript
复制
<?xml version="1.0" encoding="UTF-8"?>
<proteinAtlas xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://v21.proteinatlas.org/download/proteinatlas.xsd" schemaVersion="2.6">
    <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
        <name>TSPAN6</name>
        <synonym>T245</synonym>
        <synonym>TM4SF6</synonym>
        <synonym>TSPAN-6</synonym>
        <identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
            <xref id="O43657" db="Uniprot/SWISSPROT"/>
            <xref id="7105" db="NCBI GeneID"/>
        </identifier>
        <proteinClasses>
            <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
            <proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
        </proteinClasses>
        <proteinEvidence evidence="Evidence at protein level">
            <evidence source="HPA" evidence="Evidence at transcript level"/>
            <evidence source="MS" evidence="Not available"/>
            <evidence source="UniProt" evidence="Evidence at protein level"/>
        </proteinEvidence>
  </entry>
    <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
        <name>TNMD</name>
        <synonym>BRICD4</synonym>
        <synonym>ChM1L</synonym>
        <synonym>myodulin</synonym>
        <synonym>TEM</synonym>
        <synonym>tendin</synonym>
        <identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
            <xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
            <xref id="64102" db="NCBI GeneID"/>
        </identifier>
        <proteinClasses>
            <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
            <proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
            <proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
        </proteinClasses>
        <proteinEvidence evidence="Evidence at protein level">
            <evidence source="HPA" evidence="Evidence at transcript level"/>
            <evidence source="MS" evidence="Not available"/>
            <evidence source="UniProt" evidence="Evidence at protein level"/>
        </proteinEvidence>
  </entry>
</proteinAtlas>

您可以通过以下方法获取TNMD的数据:

代码语言:javascript
复制
$ gawk 'BEGIN{ RS="</entry[^>]*>" } /TNMD/' a

    <entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
        <name>TNMD</name>
        <synonym>BRICD4</synonym>
        <synonym>ChM1L</synonym>
        <synonym>myodulin</synonym>
        <synonym>TEM</synonym>
        <synonym>tendin</synonym>
        <identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
            <xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
            <xref id="64102" db="NCBI GeneID"/>
        </identifier>
        <proteinClasses>
            <proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
            <proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
            <proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
        </proteinClasses>
        <proteinEvidence evidence="Evidence at protein level">
            <evidence source="HPA" evidence="Evidence at transcript level"/>
            <evidence source="MS" evidence="Not available"/>
            <evidence source="UniProt" evidence="Evidence at protein level"/>
        </proteinEvidence>

这仅仅意味着“如果这一行与TNMD匹配,就打印它”。当然,如果该行类似于87% identity to TNMD,并且由于我们没有使用正确的解析器,它必然会在各种边缘情况下中断,那么它也会打印出来。

有了正确的解析器,您可以指定字符串的确切位置。

票数 3
EN

Unix & Linux用户

发布于 2022-10-20 19:08:50

假设输入是格式良好的XML文档(如特顿的回答中的示例,但不是问题中所示),您可以使用xmlstarlet输出每个entry节点的副本,其中包含特定的nameproteinClass的S source属性。

代码语言:javascript
复制
xmlstarlet select --template \
   --copy-of '//entry[name = "TNMD" and proteinClasses/proteinClass/@source = "MEMSAT3"]' \
   -nl file

这将选择具有一个具有特定entry属性值的proteinClasses/proteinClass子节点的特定name的所有source节点。每个匹配的entry节点的副本将通过添加一个尾换行符输出。

票数 6
EN
页面原文内容由Unix & Linux提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://unix.stackexchange.com/questions/721770

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档