文章/答案/技术大牛

发布

社区首页 >问答首页 >如何用iso-8859-1编码的lxml使用child.text函数读取简单xml文件

问如何用iso-8859-1编码的lxml使用child.text函数读取简单xml文件
EN

Stack Overflow用户

提问于 2019-06-13 13:51:56

回答 2查看 731关注 0票数 0

我有结构简单的xml文件：

<?xml version="1.0" encoding="iso-8859-1"?>
<DICTIONARY>
    <Tag1>
        Übung1 Übersetzung1
        Übung2 Übersetzung2
        Übung3 Übersetzung3
        Übung4 Übersetzung4
        Übung5 Übersetzung5
    </Tag1>
    <Tag2>
        Übung6 Übersetzung6
        Übung7 Übersetzung7
        Übung8 Übersetzung8
        Übung9 Übersetzung9
        Übung10 Übersetzung10
    </Tag2>
</DICTIONARY>

我想用lxml阅读这些文件，因为它很简单。我使用child.text读取文本部分，但编码似乎没有传递给输出字符串。请参阅下面的代码和输出。

我已经用编解码器读取了iso-8859-1的文件，但是它没有改变任何东西。

from lxml import etree
import codecs

def read_xml(): 
    taglist=[]
    new_dicts=[]
    with codecs.open("A:/test/test.txt", 'r', 
                     encoding='iso-8859-1') as xmlfile:
        try:
            tree=etree.parse(xmlfile)
            loaded=True
            print ("XML-encoding: ",tree.docinfo.encoding)
        except:
            loaded=False
            print ("""No dictionary loaded or xml structure is missing! Please try again!""")


    if loaded:

        root = tree.getroot()

        for child in root:
            new_dict={}
            tagname=child.tag
            taglist.append(tagname)

            print ("Loading dictionary for tag: ",
                   tagname)
            allstrings= child.text                
            allstrings=allstrings.split("\n")

            for line in allstrings:
                if line!=" " and line!="":
                    line=line.split("\t")
                    if line[0]!="" and line[1]!="":
                        enc_line0=line[0]
                        enc_line1=line[1]
                        new_dict.update({enc_line0:enc_line1})
            new_dicts.append(new_dict)

    return taglist, new_dicts
print (read_xml())

输出：

XML-encoding:  iso-8859-1
Loading dictionary for tag:  Tag1
Loading dictionary for tag:  Tag2
(['Tag1', 'Tag2'], [{'Ã\x9cbung1': 'Ã\x9cbersetzung1', 'Ã\x9cbung2': 'Ã\x9cbersetzung2', 'Ã\x9cbung3': 'Ã\x9cbersetzung3', 'Ã\x9cbung4': 'Ã\x9cbersetzung4', 'Ã\x9cbung5': 'Ã\x9cbersetzung5'}, {'Ã\x9cbung6': 'Ã\x9cbersetzung6', 'Ã\x9cbung7': 'Ã\x9cbersetzung7', 'Ã\x9cbung8': 'Ã\x9cbersetzung8', 'Ã\x9cbung9': 'Ã\x9cbersetzung9', 'Ã\x9cbung10': 'Ã\x9cbersetzung10'}])

然而，我希望得到的输出方式与命令打印(例如“bung”)相同。我做错什么了？

python-3.x

lxml

iso-8859-1

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-06-17 07:22:28

lxml可以处理二进制文件。试着改变

with codecs.open("A:/test/test.txt", 'r', 
                 encoding='iso-8859-1') as xmlfile:

至

with codecs.open("A:/test/test.txt", 'rb', 
                 encoding='iso-8859-1') as xmlfile:

票数 0

Stack Overflow用户

发布于 2019-06-16 11:26:23

好的，我没有找到合适的解决方案，但是通过转换UTF-8中的所有内容，我对进一步的步骤没有问题--比如比较字典和其他字符串中的单词和umlaut。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56582053

复制

相似问题

问如何用iso-8859-1编码的lxml使用child.text函数读取简单xml文件
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用iso-8859-1编码的lxml使用child.text函数读取简单xml文件EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何用iso-8859-1编码的lxml使用child.text函数读取简单xml文件
EN