文章/答案/技术大牛

发布

社区首页 >问答首页 >无法使用Python解析XML中的标记

问无法使用Python解析XML中的标记
EN

Stack Overflow用户

提问于 2016-06-05 13:06:50

回答 3查看 1.3K关注 0票数 1

我不知道如何从作为Characters文件一部分的XML中获取标记DOCX。DOCX文件包含多个文件，包括app.xml。我想从这个<Characters>获取标记或属性XML。

from lxml import etree

def docx_get_characters_number(path):
    document = zipfile.ZipFile(path)
    xml_content = document.read('docProps/app.xml')
    document.close()
    root = etree.fromstring(xml_content,etree.XMLParser())
    return root.xpath('.//Characters')

这个函数返回[]，但我不知道为什么。

为了测试解析器是否工作，我打印了返回如下内容的root.xpath('.//*')：

[<Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Template at 0x3a8d260>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}TotalTime at 0x3a8d288>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Pages at 0x3a8d2b0>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Words at 0x3a8d2d8>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Characters at 0x3a8d300>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Application at 0x3a8d328>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}DocSecurity at 0x3a8d350>, <Element {http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}Lines at 0x3a8d378>, ... etc.

你知道问题出在哪里吗？

编辑：

我已经找到了一种方法，但它并不优雅，我想我应该继续寻找另一种方法，但是：

def docx_get_characters_number(path):
    document = zipfile.ZipFile(path)
    xml_content = document.read('docProps/app.xml')
    document.close()
    these_regex="<Characters>(.+?)</Characters>"
    pattern=re.compile(these_regex)
    return re.findall(pattern,xml_content)[0]

xml

xpath

lxml

python

回答 3

Stack Overflow用户

回答已采纳

发布于 2016-06-05 13:30:50

您可以像这样使用etree模块：

#to load the xml
from lxml import etree
doc=etree.parse("the path to your xml file")


#to find the Characters tab
doc.find("Characters").text

当然，如果Characters不是根节点，则需要放置完整的路径。有点像Property/Character

票数 -1

Stack Overflow用户

发布于 2016-06-05 13:56:45

这是一个与默认命名空间相关的常见问题。XML在根元素处声明了默认名称空间：

xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties}"

这意味着，所有没有前缀的元素，包括目标元素<Characters>，都会在该命名空间中得到考虑。在命名空间中引用元素的正确方法是将前缀映射到命名空间URI，并在XPath中相应地使用该前缀：

.....
ns = {'d': 'http://schemas.openxmlformats.org/officeDocument/2006/extended-properties'}
return root.xpath('d:Characters', namespaces=ns)

票数 1

Stack Overflow用户

发布于 2016-06-05 13:33:50

本例中的问题是文件的xml声明。属性独立可能只包含值是或否。所以我机器上的lxml在解析阶段就退出了..。

当我使用以下文件时：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties>
    <Characters>13088</Characters>
</Properties>

下面的脚本确实找到了包含元素的字符计数：

#! /usr/bin/env python
# -*- coding: UTF-8 -*-
"""Explain here what this module has to offer."""
from __future__ import print_function

from lxml import etree as et


def docx_get_characters_number(path_unzipped):
    """Changed to focus on parsing the XML file."""
    with open(path_unzipped, 'rt') as f:
        xml_string = f.read()
    return et.fromstring(xml_string, et.XMLParser()).xpath('.//Characters')

if __name__ == '__main__':
    path_unzipped = 'docProps/app.xml'
    print(docx_get_characters_number(path_unzipped))

因此它给出了：

[<Element Characters at 0x108bcb518>]

相反，在xml声明中，独立属性的值从"yes“替换为"true”，给出如下结果：

Traceback (most recent call last):
File "/some_place/so_parse_characters_xpath.py", line 17, in <module>
  print(docx_get_characters_number(path_unzipped))
File "/some_place/so_parse_characters_xpath.py", line 13, in docx_get_characters_number
  return et.fromstring(xml_string, et.XMLParser()).xpath('.//Characters')
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77697)
File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
File "src/lxml/parser.pxi", line 1707, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115144)
File "src/lxml/parser.pxi", line 1079, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:109543)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)

lxml.etree.XMLSyntaxError:独立用户只接受“是”或“否”，第1行，第50栏

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/37642258

复制

相似问题

问无法使用Python解析XML中的标记
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用Python解析XML中的标记EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用Python解析XML中的标记
EN