有类似的问题,但没有一个完全解决我的问题。
简单地说,我需要打印包含任何字符串的每一个块。每个块开始行包含:
详情见下文:
我想搜索一个庞大的文件(数十万行),如果在模式区域(块)中识别了某个字符串,则在模式之间打印每个区域(块)。
我可以在模式之间打印整个区域,其中这些块的开始和结束标识符是"/awk '/<entry version=/{flag=1} flag; /<entry version=/{flag=0}'。
但是,如果在这些模式之间找到了特定的字符串,那么如何使它只打印整个块呢?
实际数据的最短部分对于块区域(实际上每个块有数千行长)是这样的,我想感谢Terdon为我使用的一个更好的示例进行排序:
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
在上面的真实格式中,我将检查特定字符串的名称和同义词,所以如果我要查找"TSPAN6“,那么这个块就会被打印出来。每个块都有上千行,所以下面只是一个由迷你组成的版本,来解释如何根据块中的字符串匹配来打印块。
下面是一个例子,如果我的字符串是"MEMSAT“和"TNMD”
示例输入:
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
示例输出:
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry> 发布于 2022-10-20 16:19:32
您可以在GNU中使用</entry[^>]*>作为记录分隔符。例如,使用此文件作为输入:
<?xml version="1.0" encoding="UTF-8"?>
<proteinAtlas xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://v21.proteinatlas.org/download/proteinatlas.xsd" schemaVersion="2.6">
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000003">
<name>TSPAN6</name>
<synonym>T245</synonym>
<synonym>TM4SF6</synonym>
<synonym>TSPAN-6</synonym>
<identifier id="ENSG00000000003" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="O43657" db="Uniprot/SWISSPROT"/>
<xref id="7105" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="Ezkurdia et al 2014" id="Eb" parent_id="" name="Protein evidence (Ezkurdia et al 2014)"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>
</entry>
</proteinAtlas>您可以通过以下方法获取TNMD的数据:
$ gawk 'BEGIN{ RS="</entry[^>]*>" } /TNMD/' a
<entry version="21.1" url="http://v21.proteinatlas.org/ENSG00000000005">
<name>TNMD</name>
<synonym>BRICD4</synonym>
<synonym>ChM1L</synonym>
<synonym>myodulin</synonym>
<synonym>TEM</synonym>
<synonym>tendin</synonym>
<identifier id="ENSG00000000005" db="Ensembl" version="103.38" assembly="GRCh38.p13" gencodeVersion="37">
<xref id="Q9H2S6" db="Uniprot/SWISSPROT"/>
<xref id="64102" db="NCBI GeneID"/>
</identifier>
<proteinClasses>
<proteinClass source="MDM" id="Ma" parent_id="" name="Predicted membrane proteins"/>
<proteinClass source="MDM" id="Md" parent_id="" name="Membrane proteins predicted by MDM"/>
<proteinClass source="MEMSAT3" id="Me" parent_id="" name="MEMSAT3 predicted membrane proteins"/>
</proteinClasses>
<proteinEvidence evidence="Evidence at protein level">
<evidence source="HPA" evidence="Evidence at transcript level"/>
<evidence source="MS" evidence="Not available"/>
<evidence source="UniProt" evidence="Evidence at protein level"/>
</proteinEvidence>这仅仅意味着“如果这一行与TNMD匹配,就打印它”。当然,如果该行类似于87% identity to TNMD,并且由于我们没有使用正确的解析器,它必然会在各种边缘情况下中断,那么它也会打印出来。
有了正确的解析器,您可以指定字符串的确切位置。
发布于 2022-10-20 19:08:50
假设输入是格式良好的XML文档(如特顿的回答中的示例,但不是问题中所示),您可以使用xmlstarlet输出每个entry节点的副本,其中包含特定的name和proteinClass的S source属性。
xmlstarlet select --template \
--copy-of '//entry[name = "TNMD" and proteinClasses/proteinClass/@source = "MEMSAT3"]' \
-nl file这将选择具有一个具有特定entry属性值的proteinClasses/proteinClass子节点的特定name的所有source节点。每个匹配的entry节点的副本将通过添加一个尾换行符输出。
https://unix.stackexchange.com/questions/721770
复制相似问题