文章/答案/技术大牛

发布

社区首页 >问答首页 >用Python解析乱七八糟的XML

问用Python解析乱七八糟的XML
EN

Stack Overflow用户

提问于 2021-04-02 04:13:41

回答 2查看 53关注 0票数 0

我是编码新手，如果有人能帮我弄清楚如何解析XML文件，那就太棒了。我正在尝试编写一个python脚本，它可以读取在Gnome-Notes中创建的所有注释，并将其显示在命令行中。我得到了加载注释部分，但我不知道如何解析XML以使其显示文本部分。示例数据如下所示：

<?xml version="1.0" encoding="UTF-8"?>
<note version="1" xmlns:link="http://projects.gnome.org/bijiben/link" xmlns:size="http://projects.gnome.org/bijiben/size" xmlns="http://projects.gnome.org/bijiben">
  <title>Testnote</title>
  <text xml:space="preserve"><html xmlns="http://www.w3.org/1999/xhtml"><head><link rel="stylesheet" href="Default.css" type="text/css" /><script language="javascript" src="bijiben.js"></script></head><body id="editable" style="color: white;">Some text for the note.</body></html></text>
  <last-change-date>2021-04-01T20:03:08Z</last-change-date>
  <last-metadata-change-date>2021-04-01T20:02:53Z</last-metadata-change-date>
  <create-date>2021-03-29T10:37:14Z</create-date>
  <cursor-position>0</cursor-position>
  <selection-bound-position>0</selection-bound-position>
  <width>0</width>
  <height>0</height>
  <x>0</x>
  <y>0</y>
  <color>rgb(0,0,0)</color>
 <tags/>
  <open-on-startup>False</open-on-startup>

在解析之后，我应该只得到“注释的一些文本”。部分。为此，我一直在尝试使用ElementTree。虽然我在使用示例中提供的“干净”xml文件时没有遇到问题，但我不知道如何处理这个文件。

python

xml

回答 2

Stack Overflow用户

发布于 2021-04-02 04:35:42

使用ElementTree应该是可行的

from xml.etree import ElementTree as ET

data = '''\
<?xml version="1.0" encoding="UTF-8"?>
<note version="1" xmlns:link="http://projects.gnome.org/bijiben/link" xmlns:size="http://projects.gnome.org/bijiben/size" xmlns="http://projects.gnome.org/bijiben">
    <title>Testnote</title>
    <text xml:space="preserve">
        <html xmlns="http://www.w3.org/1999/xhtml">
            <head>
                <link rel="stylesheet" href="Default.css" type="text/css"/>
                <script language="javascript" src="bijiben.js"/>
            </head>
            <body id="editable" style="color: white;">Some text for the note.</body>
        </html>
    </text>
    <last-change-date>2021-04-01T20:03:08Z</last-change-date>
    <last-metadata-change-date>2021-04-01T20:02:53Z</last-metadata-change-date>
    <create-date>2021-03-29T10:37:14Z</create-date>
    <cursor-position>0</cursor-position>
    <selection-bound-position>0</selection-bound-position>
    <width>0</width>
    <height>0</height>
    <x>0</x>
    <y>0</y>
    <color>rgb(0,0,0)</color>
    <tags/>
    <open-on-startup>False</open-on-startup>
</note>
'''

tree = ET.fromstring(data)
nmsp = {
    'xml': 'http://www.w3.org/1999/xhtml',
}  # NAMESPACE PREFIX ASSIGNMENT

print(tree.find('.//xml:body', namespaces=nmsp).text)

票数 1

Stack Overflow用户

发布于 2021-04-02 04:27:26

您可以使用正则表达式来提取body标记之间的字符串：

<body.*>(.*)</body>

第一个.*匹配任何字符，零次或多次，以说明body标记中的任何属性。

(.*)捕获标签之间的任何内容。

import re

with open('file.xml', 'r') as file:
    data = file.read()
    x = re.search(r"<body.*>(.*)</body>", data)

    print(x.group(1))

票数 -1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/66911003

复制

相似问题

问用Python解析乱七八糟的XML
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python解析乱七八糟的XMLEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python解析乱七八糟的XML
EN