我有下面的MWE,我想在这里存档一个包含所有顶级rdf的列表:条目中的描述Elememts。我试图解析的实际转储中有超过200万个元素,这就是我特别想使用iterparse的原因
from io import BytesIO
from lxml import etree
from copy import deepcopy
xmlstring = """<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:wdrs="http://www.w3.org/2007/05/powder-s#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="https://ld.zdb-services.de/resource/5-X">
<dc:subject rdf:datatype="https://d-nb.info/standards/elementset/dnb#ddc-subject-category">940</dc:subject>
<wdrs:describedby>
<rdf:Description rdf:about="https://ld.zdb-services.de/data/5-X">
<dcterms:license rdf:resource="http://creativecommons.org/publicdomain/zero/1.0/"/>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2020-03-18T12:21:08.000</dcterms:modified>
</rdf:Description>
</wdrs:describedby>
</rdf:Description>
<rdf:Description rdf:about="https://ld.zdb-services.de/resource/7-3">
<dc:subject rdf:datatype="https://d-nb.info/standards/elementset/dnb#ddc-subject-category">590</dc:subject>
<wdrs:describedby>
<rdf:Description rdf:about="https://ld.zdb-services.de/data/7-3">
<dcterms:license rdf:resource="http://creativecommons.org/publicdomain/zero/1.0/"/>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2021-11-03T21:41:02.000</dcterms:modified>
</rdf:Description>
</wdrs:describedby>
<dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#string">1963-2008</dcterms:issued>
<owl:sameAs>
<rdf:Description rdf:about="https://d-nb.info/01000002X">
<owl:sameAs>
<rdf:Description rdf:about="https://ld.zdb-services.de/resource/7-3">
<owl:sameAs rdf:resource="http://hub.culturegraph.org/resource/DNB-01000002X"/>
</rdf:Description>
</owl:sameAs>
</rdf:Description>
</owl:sameAs>
</rdf:Description>
</rdf:RDF>
"""
entries = []
for event, elem in etree.iterparse(BytesIO(xmlstring.encode("UTF-8")), tag='{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description' ,events=("start", "end")):
if elem.tag == "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description" and event == "end":
entries.append(elem)
elem.clear()
len(entries)此代码现在返回6,因为它找到了6个启动的<rdf:Description标记。我希望像使用parse和tree.findall('./rdf:Description', NAMESPACES)那样提取每个元素。谢谢你的帮忙!
编辑:
从那时起,我发现对父标记的筛选有助于选择:
entries = []
for event, elem in etree.iterparse(BytesIO(xmlstring.encode("UTF-8")), events=("start", "end")):
if elem.tag == "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description" and event == "end":
if elem.getparent().tag == "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF":
entries.append(deepcopy(elem))
elem.clear()但是元素的深度副本仍然缺少嵌套的描述标记。所以这个works_
for e in entries:
print(e.attrib['{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about'])但这并不是:
for e in entries:
print(e.find('.//{http://purl.org/dc/terms/}:modified').text)因为不知何故,内部<rdf:Description>标记丢失了。
发布于 2022-07-05 14:34:42
为什么不选择一个生成器表达式或列表理解?
elements = (elem.attrib['{http://www.w3.org/1999/02/22-rdf-syntax-ns#}about'] for event, elem in etree.iterparse(BytesIO(xmlstring.encode("UTF-8")), tag='{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description' ,events=("start", "end")) if elem.tag == "{http://www.w3.org/1999/02/22-rdf-syntax-ns#}Description" and event == "end")这将产生六个条目。这就是你想要达到的目标吗?
In [11]: next(gen)
Out[11]: 'https://ld.zdb-services.de/data/5-X'https://stackoverflow.com/questions/72869343
复制相似问题