文章/答案/技术大牛

发布

社区首页 >问答首页 >从数据到层次化xml

问从数据到层次化xml
EN

Stack Overflow用户

提问于 2019-08-27 20:53:35

回答 1查看 343关注 0票数 1

将csv读入dataframe，然后使用lxml库将其转换为xml。

这是我第一次处理xml，似乎取得了部分成功。任何帮助都将不胜感激。

用于创建数据文件的CSV文件：

Parent,Element,Text,Attribute
,TXLife,"
    ",{'Version': '2.25.00'}
TXLife,UserAuthRequest,"
        ",{}
UserAuthRequest,UserLoginName,*****,{}
UserAuthRequest,UserPswd,"
            ",{}
UserPswd,CryptType,None,{}
UserPswd,Pswd,****,{}
TXLife,TXLifeRequest,"
        ",{'PrimaryObjectID': 'Policy_1'}
TXLifeRequest,TransRefGUID,706D67C1-CC4D-11CF-91FB444554540000,{}
TXLifeRequest,TransType,Holding Change,{'tc': '502'}
TXLifeRequest,TransExeDate,2006-11-19,{}
TXLifeRequest,TransExeTime,13:15:33-07:00,{}
TXLifeRequest,ChangeSubType,"
            ",{}
ChangeSubType,ChangeTC,Change Participant,{'tc': '9'}
TXLifeRequest,OLifE,"
            ",{}
OLifE,Holding,"
                ",{'id': 'Policy_1'}
Holding,HoldingTypeCode,Policy,{'tc': '2'}
Holding,Policy,"
                    ",{}
Policy,PolNumber,1234567,{}
Policy,LineOfBusiness,Annuity,{'tc': '2'}
Policy,Annuity,,{}
OLifE,Party,"
                ",{'id': 'Beneficiary_1'}
Party,PartyTypeCode,Organization,{'tc': '2'}
Party,FullName,The Smith Trust,{}
Party,Organization,"
                    ",{}
Organization,OrgForm,Trust,{'tc': '16'}
Organization,DBA,The Smith Trust,{}
OLifE,Relation,"
                ","{'id': 'Relation_1', 'OriginatingObjectID': 'Policy_1', 'RelatedObjectID': 'Beneficiary_1'}"
Relation,OriginatingObjectType,Holding,{'tc': '4'}
Relation,RelatedObjectType,Party,{'tc': '6'}
Relation,RelationRoleCode,Primary Beneficiary,{'tc': '34'}
Relation,BeneficiaryDesignation,Named,{'tc': '1'}

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv') .fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))

# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()

for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:
    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # list(root) = root.getchildren()
    children = [item for item in str(list(root)).split(' ')]
    rootstring = str(root).split(' ')[1]

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == str(root).split(' ')[1]:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif not prnt == rootstring and prnt in children:
        child = etree.SubElement(parent, elem, attrib).text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib).text = txt

print(etree.tostring(tree, pretty_print=True).decode())

实际结果：

*无xxxxxx 706D67C1-CC4D-11CF-91FB44454540000持有更改11/19/200613:15:33-07:00参与人保险单1234567年金组织史密斯信托信托史密斯信托控股方主要受益人命名

预期结果：

*无**706D67C1-CC4D-11 33 91FB444554540000持有变更2006-11-19 13:15:33-07:00参与方政策1234567年金组织史密斯信托基金史密斯信托控股方主要受益人

如何才能得到上面所示的分层结果？

python

xml

pandas

lxml

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-08-31 03:11:27

你开了个好头！我认为这将是最简单的做法，通过您的代码逐点，并解释它需要调整的地方，并建议一些改进：

读取和清理数据

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))

.apply工作得很好，但也有一个可以使用的.str.replace()方法，它会更整洁、更清晰( .str允许您将列的值作为字符串类型处理，并相应地对它们进行操作)。

加根

# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()

一切都很好！

绕行

for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:

由于您正在检索所有的列，所以您不需要在dfc中索引来选择它们，所以您可以删除该部分：

for prnt, elem, txt, attr in dfc[1:].values:

这很好，但是有一些内置的方法可以迭代DataFrame中的项，我们可以使用itertuples()。这将为每一行返回一个NamedTuple，其中包括索引(基本上是行号)作为元组中的第一个项，因此我们需要对此进行调整：

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():

设置变量

    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # list(root) = root.getchildren()
    children = [item for item in str(list(root)).split(' ')]
    rootstring = str(root).split(' ')[1][1:].values:

用双引号替换单引号是一个很好的技巧，因此我们可以使用json将属性转换为字典。每个Element都有一个.tag属性，我们可以使用它来获取名称，这正是我们在这里想要的：

children = [item.tag for item in root]
rootstring = root.tag

list(root)或root.getchildren()都会给我们一个root子元素的列表，但是我们也可以使用像这样的root使用for ... in循环遍历它们。

将元素添加到树中

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == str(root).split(' ')[1]:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif not prnt == rootstring and prnt in children:
        child = etree.SubElement(parent, elem, attrib).text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib).text = txt

str(root).split(' ')[1]正是我们将rootstring设置在上面的地方，所以我们可以用它来代替
因为我们已经在第一个prnt == rootstring语句中检查了是否if，如果我们已经到达第一个elif，我们知道它不能相等，所以我们不需要再次检查它
当我们创造孩子的时候，我们同时有两个任务.它成功地创建了文本(！)，但这意味着child被设置为text，而不是新的SubElement。最好分两步这样做。
当我们寻找父程序时，我们目前正在创建一个列表列表(split()返回一个列表)，因此它无法工作。我们要的是项目标签。

所有这些改变给了我们：

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == rootstring:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif prnt in children:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [item.tag for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt

但这里有一些问题。

第一部分(if语句)很好。

在第二部分(第一个elif语句)中，我们检查新元素的父元素是否是根的子元素之一。如果是，我们将新元素添加为parent的子元素。parent绝对是根的一个子类，但是我们还没有实际检查它是否是正确的。这只是我们在root中添加的最后一件事。幸运的是，因为我们的CSV有所有的元素，这是正确的，但更好的做法是更加明确这一点。

在第三部分(第二部分elif)中，最好检查prnt是否已经在树下存在。但是目前，如果prnt不存在，我们将把这个元素添加到parent中，这不是它的实际父元素！如果存在prnt，则根本不添加元素(因此这里需要一个else子句)。

解决方案

幸运的是，有一个简单的方法:我们可以使用.find()找到prnt元素，无论它在树中的哪个位置，然后在树上添加新元素。这使得整件事也变得更短了！

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    child = etree.SubElement(parent, elem, attrib)
    child.text = txt

.// in root.find(".//" + prnt)意味着它将在树中的任何位置搜索匹配的元素标记(在这里阅读更多信息：https://lxml.de/tutorial.html#elementpath)。

最终脚本

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna("NA")
dfc['Attribute'] = dfc['Attribute'].str.replace("'", '"').apply(lambda s: json.loads(s))

# Add the root element for xml
root = etree.Element(dfc['Element'][0], dfc['Attribute'][0])

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Fix text
    text = txt.strip()
    if not text:
        text = None
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    # Create element
    child = etree.SubElement(parent, elem, attr)
    child.text = text

xml_string = etree.tostring(root, pretty_print=True).decode().replace(">NA<", "><")
print(xml_string)

我又做了几处改变：

我将属性字典的json.loads位移到更改引号时为止，然后使用apply将其添加到末尾。我们需要它，这样当我们创建根元素时，字典就准备好了。
要让漂亮的打印正常工作存在一些问题，这就是"Fix“部分的目的(关于我遇到的问题，请参阅this Stack Overflow question )。
让.fillna("") (填充空字符串)是最整洁的，但是如果我们这样做，我们将使用</Annuity>而不是<Annuity></Annuity> (这是合法的<Annuity></Annuity>-如果您有一个没有文本或子元素的元素，您可以只做结束标记)。但是为了让它按我们的意愿发布，我们需要它有一些“内容”，以便创建开始标记。因此，我将其保留为.fillna("NA")，然后在末尾手动替换输出字符串中的内容。

还应该注意到，这个脚本(至少)对输入数据作了四个假设：

父元素在其子元素之前创建(即它们在CSV文件中更高的位置)。
该元素名称是唯一的(或者至少，任何重复的名称都没有任何子名称，因此我们永远不会在可能有多个匹配的情况下执行.find()；.find()总是返回第一个匹配)
您希望在最终的XML中保留任何'NA‘的文本值(当我们从Annuity元素中删除虚假的'NA’文本时，它们也会被删除)
仅由空格组成的文本不需要保留

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57682123

复制

相似问题

问从数据到层次化xml
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从数据到层次化xmlEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从数据到层次化xml
EN