这是here的后续问题.因为这个论坛上有大量的其他话题,所以它丢失了。也许我提的问题太复杂了。从那时起,我改进并简化了这个方法。总之:我希望从多个XML文件中的子字段中提取数据,并将这些数据附加到匹配位置上的新df中。
这是一个示例XML-1:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>238981</vrednost>
</energent>
<energent>
<sifra>energy_to</sifra>
<naziv>Do</naziv>
<vrednost>16359</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>his是一个示例XML-2:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qfl>1808</Qfl>
<fOVE>13.7</fOVE>
<NetoVolumen>613</NetoVolumen>
<Hv>104.2</Hv>
<energenti>
<energent>
<sifra>energy_e</sifra>
<naziv>EE [kWh]</naziv>
<vrednost>424242</vrednost>
</energent>
<energent>
<sifra>energy_en</sifra>
<naziv>Do</naziv>
<vrednost>29</vrednost>
</energent>
<rei>
<zavetrovanost>2</zavetrovanost>
<cone>
<cona>
<cona_id>1</cona_id>
<cc_si_cona>1110000</cc_si_cona>
<visina_cone>2.7</visina_cone>
<dolzina_cone>14</dolzina_cone>
</cona>
<cona>
<cona_id>2</cona_id>
<cc_si_cona>120000</cc_si_cona>
</cona>
</rei>
</reiXmlPrenos>我的代码:
import xml.etree.ElementTree as ETree
import pandas as pd
xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()
# print(root)
store_items = []
all_items = []
for storeno in root.iter('energent'):
cona_sifra = storeno.find('sifra').text
cona_vrednost = storeno.find('vrednost').text
store_items = [cona_sifra, cona_vrednost]
all_items.append(store_items)
xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])
print(xmlToDf.to_string(index=False))这导致:
sifra vrednost
energy_e 238981
energy_to 16359这对于一个例子来说是很好的。但是我有1000个XML文件,希望1)每个XML都有1行的结果,2)区分不同的'sifra‘代码。
可以有例如energy_e, energy_en, energy_to
所以理想情况下,最终的df应该是这样的
xml energy_e energy_en energy_to
xml-1 238981 0 16539
xml-2 424242 29 0 能办到吗?
发布于 2022-11-04 20:26:24
只需使用,因为您需要的部分是文档的平面部分:
energy_df = pd.read_xml("Input.xml", xpath=".//energent") # IF lxml INSTALLED
energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree") # IF lxml NOT INSTALLED要绑定多个XML文件,只需从XML文件路径列表构建一个数据帧列表,为源文件添加一个列,然后运行将所有数据绑定到单个数据框架中:
xml_files = [...]
energy_dfs = [
pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]
energy_long_df = pd.concat(energy_dfs, ignore_index=True)然后,从您想要的输出中,您可以使用sifra列使用传递值。
energy_wide_df = energy_long_df.pivot_table(
values="vrednost", index="source", columns="sifra", aggfunc="sum"
)发布于 2022-11-04 12:59:34
如果我正确理解这种情况,这是可以做到的-但由于复杂性,我将在这里使用lxml,而不是ElementTree。
我将尝试对代码进行一点注释,但您必须认真阅读这方面的内容。
顺便说一下,您发布的两个xml文件的格式不太好( <energenti>和<cone>的关闭标记丢失了),但是假设这是固定的-尝试如下:
from lxml import etree
xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
row = []
id = "xml-"+str(xmls.index(xml)+1)
#this creates the file identifier
row.append(id)
root = etree.XML(xml.encode())
#in real life, you'll have to use the parse() method
for energy in energies[1:]:
#the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
#note the use of f-strings
row.extend( target if len(target)>0 else "0" )
rows.append(row)
print(pd.DataFrame(rows,columns=energies))输出:
xml energy_e energy_en energy_to whatever
0 xml-1 238981 0 16359 0
1 xml-2 424242 29 0 0https://stackoverflow.com/questions/74314696
复制相似问题