文章/答案/技术大牛

发布

社区首页 >问答首页 >Python或PETL解析XML

问Python或PETL解析XML
EN

Stack Overflow用户

提问于 2021-11-02 17:49:02

回答 1查看 63关注 0票数 0

我一直在尝试使用PETL，看看是否可以提取多个xml文件并将它们组合成一个文件。

我无法控制XML文件的结构，下面是我看到的一些变体，它们给我带来了麻烦。

XML文件1示例：

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name>John Doe</Name>
            <Date>01/01/2021</Date>
        </Info>
        <App>
            <Description></Description>
            <Type>Two</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

XML文件2示例：

<?xml version="1.0" encoding="utf-8"?>
    <Export>
        <Info>
            <Name></Name>
            <Date>01/02/2021</Date>
        </Info>
        <App>
            <Description>Sample description here.</Description>
            <Type>One</Type>
            <Details>
                <DetailOne>1</DetailOne>
                <DetailTwo>2</DetailTwo>
                <DetailOne>3</DetailOne>
                <DetailTwo>4</DetailTwo>
            </Details>
            <Details>
                <DetailOne>10</DetailOne>
                <DetailTwo>11</DetailTwo>
            </Details>
        </App>
    </Export>

我的python代码只是扫描子文件夹xmlfiles，然后尝试使用PETL从那里进行解析。根据文档的结构，到目前为止，我加载了三个表：

1保存Info名称和日期2保存描述，键入3收集详细信息

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
            'DetailOne': 'DetailOne',
            'DetailTwo': 'DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)

我将这三个表连接在一起，因为我想要每一行上的Info和App数据以及每个细节。在我得到一个包含多个DetailOne和DetailTwo元素的XML文件之前，这种方法一直有效。

我得到的结果是：

结果：

 +------------+----------+-------------+------+-----------+-----------+----------+
| Date       | Name     | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None        | Two  | 1         | 2         | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None        | Two  | 10        | 11        | one.xml  |
+------------+----------+-------------+------+-----------+-----------+----------+

结果：

 +------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | ('1', '3') | ('2', '4') | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

显示DetailOne为('1'，'3')和DetailTwo为('2'，'4')的第二个文件不是我想要的。

我想要的是：

+------------+------+--------------------------+------+------------+------------+----------+
| Date       | Name | Description              | Type | DetailOne  | DetailTwo  | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One  | 1          | 2          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 3          | 4          | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One  | 10         | 11         | two.xml  |
+------------+------+--------------------------+------+------------+------------+----------+

我认为XPath可能是一条可行的道路，但经过研究后：

https://petl.readthedocs.io/en/stable/io.html#xml-files -没有深入讨论lxml和petl

这里有一些轻松的读物：https://www.w3schools.com/xml/xpath_syntax.asp

更多信息请点击这里：https://lxml.de/tutorial.html

在这方面的任何帮助都是非常感谢的！

python

xml

etl

petl

回答 1

Stack Overflow用户

发布于 2021-11-03 22:38:21

首先，感谢您花时间写了一个很好的问题。我很乐意花时间回答这个问题。

我从未使用过PETL，但我确实扫描了文档中的XML处理。我认为您的主要问题是<Details>标记有时包含一对标记，有时包含多对标记。如果有一种方法可以提取和标记值的平面列表，而不使用封闭的标记……

幸运的是有。我使用了https://www.webtoolkitonline.com/xml-xpath-tester.html，当将XPath表达式//Details/DetailOne应用于示例XML时，它将返回列表1,3,10。

所以我猜想像这样的东西应该是可行的：

import petl as etl
import os
from lxml import etree

for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
    if filename.endswith('.xml'):
        # Get the info children
        table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
            'Name': 'Name',
            'Date': 'Date'
        })

        # Get the App children
        table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
            'Description': 'Description',
            'Type': 'Type'
        })

        # Get the App Details children
        table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
            'DetailOne': '//DetailOne',
            'DetailTwo': '//DetailTwo'
        })

        # concat
        c = etl.crossjoin(table1, table2, table3)
        # I want the filename added on
        result = etl.addfield(c, 'FileName', filename)

        print('Results:\n', result)

前导//可能是多余的。它是“在文档中的任何级别”的XPath语法。我不知道PETL是如何处理XPath的，所以我尽量保证安全。我同意顺便说一句--文档中的细节相当少。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69814848

复制

相似问题

问Python或PETL解析XML
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python或PETL解析XMLEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python或PETL解析XML
EN