我一直在尝试使用PETL,看看是否可以提取多个xml文件并将它们组合成一个文件。
我无法控制XML文件的结构,下面是我看到的一些变体,它们给我带来了麻烦。
XML文件1示例:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name>John Doe</Name>
<Date>01/01/2021</Date>
</Info>
<App>
<Description></Description>
<Type>Two</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>XML文件2示例:
<?xml version="1.0" encoding="utf-8"?>
<Export>
<Info>
<Name></Name>
<Date>01/02/2021</Date>
</Info>
<App>
<Description>Sample description here.</Description>
<Type>One</Type>
<Details>
<DetailOne>1</DetailOne>
<DetailTwo>2</DetailTwo>
<DetailOne>3</DetailOne>
<DetailTwo>4</DetailTwo>
</Details>
<Details>
<DetailOne>10</DetailOne>
<DetailTwo>11</DetailTwo>
</Details>
</App>
</Export>我的python代码只是扫描子文件夹xmlfiles,然后尝试使用PETL从那里进行解析。根据文档的结构,到目前为止,我加载了三个表:
1保存Info名称和日期2保存描述,键入3收集详细信息
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App/Details', {
'DetailOne': 'DetailOne',
'DetailTwo': 'DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)我将这三个表连接在一起,因为我想要每一行上的Info和App数据以及每个细节。在我得到一个包含多个DetailOne和DetailTwo元素的XML文件之前,这种方法一直有效。
我得到的结果是:
结果:
+------------+----------+-------------+------+-----------+-----------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+==========+=============+======+===========+===========+==========+
| 01/01/2021 | John Doe | None | Two | 1 | 2 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+
| 01/01/2021 | John Doe | None | Two | 10 | 11 | one.xml |
+------------+----------+-------------+------+-----------+-----------+----------+结果:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | ('1', '3') | ('2', '4') | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+显示DetailOne为('1','3')和DetailTwo为('2','4')的第二个文件不是我想要的。
我想要的是:
+------------+------+--------------------------+------+------------+------------+----------+
| Date | Name | Description | Type | DetailOne | DetailTwo | FileName |
+============+======+==========================+======+============+============+==========+
| 01/02/2021 | None | Sample description here. | One | 1 | 2 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 3 | 4 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+
| 01/02/2021 | None | Sample description here. | One | 10 | 11 | two.xml |
+------------+------+--------------------------+------+------------+------------+----------+我认为XPath可能是一条可行的道路,但经过研究后:
https://petl.readthedocs.io/en/stable/io.html#xml-files -没有深入讨论lxml和petl
这里有一些轻松的读物:https://www.w3schools.com/xml/xpath_syntax.asp
更多信息请点击这里:https://lxml.de/tutorial.html
在这方面的任何帮助都是非常感谢的!
发布于 2021-11-03 22:38:21
首先,感谢您花时间写了一个很好的问题。我很乐意花时间回答这个问题。
我从未使用过PETL,但我确实扫描了文档中的XML处理。我认为您的主要问题是<Details>标记有时包含一对标记,有时包含多对标记。如果有一种方法可以提取和标记值的平面列表,而不使用封闭的标记……
幸运的是有。我使用了https://www.webtoolkitonline.com/xml-xpath-tester.html,当将XPath表达式//Details/DetailOne应用于示例XML时,它将返回列表1,3,10。
所以我猜想像这样的东西应该是可行的:
import petl as etl
import os
from lxml import etree
for filename in os.listdir(os.getcwd() + '.\\xmlfiles\\'):
if filename.endswith('.xml'):
# Get the info children
table1 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'Info', {
'Name': 'Name',
'Date': 'Date'
})
# Get the App children
table2 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), 'App', {
'Description': 'Description',
'Type': 'Type'
})
# Get the App Details children
table3 = etl.fromxml((os.getcwd() + '.\\xmlfiles\\' + filename), '/App', {
'DetailOne': '//DetailOne',
'DetailTwo': '//DetailTwo'
})
# concat
c = etl.crossjoin(table1, table2, table3)
# I want the filename added on
result = etl.addfield(c, 'FileName', filename)
print('Results:\n', result)前导//可能是多余的。它是“在文档中的任何级别”的XPath语法。我不知道PETL是如何处理XPath的,所以我尽量保证安全。我同意顺便说一句--文档中的细节相当少。
https://stackoverflow.com/questions/69814848
复制相似问题