文章/答案/技术大牛

发布

社区首页 >问答首页 >使用lxml Python同时检索单个对象的多个嵌套语句

问使用lxml Python同时检索单个对象的多个嵌套语句
EN

Stack Overflow用户

提问于 2020-10-30 10:11:10

回答 1查看 42关注 0票数 2

我正在使用大型xml检索许多不同的属性，现在我试图检索comment category属性并将其连接到标记之间的文本。但是，我需要处理3种不同的情况。XML示例：

<comment-list>
 <comment category="Derived from sampling site"> Peripheral blood </comment>
 <comment category="Transformant">
   <cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
 </comment>
 <comment category="Sequence variation"> Hemizygous for FMR1 &gt;200 CGG repeats (PubMed=25776194) 
 </comment>
 <comment category="Monoclonal antibody target">
   <xref-list>
     <xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
       <property-list>
         <property name="gene/protein designation" value="Human BEND3"/>
       </property-list>
       <url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
     </xref>
   </xref-list>
 </comment>
 </comment-list>

当<comment>下没有子标记时。然后，我需要检索comment category属性，并将其与标记之间的文本连接。
，当<comment>有一个嵌套在下面的<cv-term>标记时。然后，我需要检索comment category、cv-term terminology、cv-term accession和cv-term标记之间的文本。当<comment>有几个嵌套在下面的标记时，
：

<property>-<url>。在本例中，我需要检索：comment category、xref database属性、xref accession属性和property value属性。

我正在使用lxml来解析这个XML，我很难理解如何解决案例2。案例1和案例3可以工作，但是当一个对象拥有这三种情况时，输出就会变得混乱。

我想收到以下产出：

Derived from sampling site: Peripheral blood
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation: Hemizygous for FMR1 &gt;200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3

这是我非常混乱的代码，它以错误的顺序将元素排除在外。它在第1和第3种情况下运行得很好，但是当情况2起作用时，输出的顺序是错误的：

comment_cat = att.xpath('.//comment-list/comment/@category')
comment_text = att.xpath('.//comment-list/comment/text()') 
cv_term = att.xpath('.//comment-list/comment/cv-term/text()')
xref = [a + ', ' + b for a,b in zip(att.xpath('.//comment-list/comment/xref- 
list/xref/@database'),att.xpath('.//comment-list/comment/xref-list/xref/@accession'))]
property_list = att.xpath('.//comment-list/comment/xref-list/xref/property-list/property/@value')
xref_property_list = [a + ', ' + b for a,b in zip(xref, property_list)]
empty_str_in_text = ['\n      ', '\n    ', '\n      ', '\n    ']
comment_texts_all = cv_term+comment_text+xref_property_list

for e in empty_str_in_text:
    if e in comment_texts_all:
        comment_texts_all.remove(e)    
key_values['Comments'] = ';; '.join([i + ': ' + j for i, j in zip(comment_cat, 
comment_texts_all)])

输出：

Derived from sampling site: Epstein-Barr virus (EBV);; 
Transformant:  Peripheral blood ;; 
Sequence variation:  Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194) ;; 
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3

python

xml-parsing

lxml

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-10-30 11:36:21

以下是一种略为替代的方法：

    xml = '''<comment-list>
    <comment category="Derived from sampling site"> Peripheral blood </comment>
    <comment category="Transformant">
        <cv-term terminology="NCBI-Taxonomy" accession="10376">Epstein-Barr virus (EBV)</cv-term>
    </comment>
    <comment category="Sequence variation"> Hemizygous for FMR1 &gt;200 CGG repeats (PubMed=25776194)</comment>
    <comment category="Monoclonal antibody target">
        <xref-list>
            <xref database="UniProtKB" category="Sequence databases" accession="Q5T5X7">
                <property-list>
                    <property name="gene/protein designation" value="Human BEND3"/>
                </property-list>
                <url><![CDATA[https://www.uniprot.org/uniprot/Q5T5X7]]></url>
            </xref>
        </xref-list>
    </comment>
    <comment category="Knockout cell">
        <method>KO mouse</method>
        <xref-list>
            <xref database="MGI" category="Organism-specific " accession="MGI:97740">
                <property-list>
                    <property name="gene/protein designation" value="Polb"/>
                </property-list>
                <url><![CDATA[http://www.informatics.jax.org//MGI:97740]]></url>
            </xref>
        </xref-list>
    </comment>
</comment-list>'''

from lxml import etree as ET

tree = ET.fromstring(xml)

result = ''

for comment in tree.iter('comment'):
    result += f"{comment.get('category')}: "
    cv_term = comment.find('cv-term')
    xref_list = comment.find('xref-list')
    method = comment.find('method')
    if len(list(comment)) == 0:
        result += comment.text
    elif cv_term is not None:
        result += ', '.join([cv_term.get('terminology'), cv_term.get('accession'), cv_term.text])
    elif xref_list is not None and method is None:
        result += ', '.join([xref_list.xpath('./xref/@database')[0], xref_list.xpath('./xref/@accession')[0], xref_list.xpath('./xref/property-list/property/@value')[0]])
    elif method is not None:
        result += method.text
    result += '\n'

print(result)

输出：

Derived from sampling site:  Peripheral blood 
Transformant: NCBI-Taxonomy, 10376, Epstein-Barr virus (EBV)
Sequence variation:  Hemizygous for FMR1 >200 CGG repeats (PubMed=25776194)
Monoclonal antibody target: UniProtKB, Q5T5X7, Human BEND3
Knockout cell: KO mouse

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64606363

复制

相似问题

问使用lxml Python同时检索单个对象的多个嵌套语句
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lxml Python同时检索单个对象的多个嵌套语句EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用lxml Python同时检索单个对象的多个嵌套语句
EN