文章/答案/技术大牛

发布

社区首页 >问答首页 >无法使用bs4解析ISO-8859-15编码的XML

问无法使用bs4解析ISO-8859-15编码的XML
EN

Stack Overflow用户

提问于 2019-07-09 05:41:00

回答 2查看 228关注 0票数 0

我有以下XML文档，用Notepad++以ISO-8859-15编码保存：

<?xml version="1.0" encoding="ISO-8859-15"?>
<someTag>
</someTag>

我尝试使用bs4解析这个文件，但不知何故(即使在我能想到的任何地方指定编码)，我都得到了一个空结果：

filepath = 'iso-8859-15_example.xml'
with open(filepath, encoding="iso-8859-15") as f:
    soup = BeautifulSoup(f, 'xml', from_encoding="iso-8859-15")
print(soup)
# --> "<?xml version="1.0" encoding="utf-8"?>", otherwise empty

删除Python代码中的编码提示无济于事。但奇怪的是，起作用的是删除XML文件的第一行，即<?xml ... ?>语句(称为"prolog，我想)“。

我在这里做错了什么？我认为prolog会帮助bs4“做正确的事情”并选择正确的编码。除了删除前言/处理XML-file编码之外，还有其他选择吗？

xml

beautifulsoup

character-encoding

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-07-09 15:14:41

结合安德烈的答案和the duplicate question中给出的答案，我可以看到在open调用中指定raw模式解决了我的问题：

from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
with open('iso-8859-15_example.xml', 'rb') as f:
    diagnose(f)

这将导致输出

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)]
I noticed that html5lib is not installed. Installing it may help.
Found lxml version 4.3.4.0
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
 <body>
  <sometag>
  </sometag>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<someTag>
</someTag>
--------------------------------------------------------------------------------

并展示了xml模式下的lxml运行良好。

票数 1

Stack Overflow用户

发布于 2019-07-09 13:34:20

在这种情况下，我建议运行BeautifulSoup的diagnose()函数：

from bs4 import BeautifulSoup

from bs4.diagnose import diagnose

with open('iso-8859-15_example.xml', encoding="iso-8859-15") as f:
    diagnose(f.read())

在我的机器上打印如下：

Diagnostic running on Beautiful Soup 4.7.1
Python version 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]
Found lxml version 4.3.3.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" encoding="ISO-8859-15"?-->
<html>
 <head>
 </head>
 <body>
  <sometag>
  </sometag>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
 <body>
  <sometag>
  </sometag>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>

--------------------------------------------------------------------------------

在这种情况下，我会选择html.parser，因为它会做正确的事情。

所以当你这样做的时候：

soup = BeautifulSoup(f.read(), 'html.parser')
print(soup)

它打印：

<?xml version="1.0" encoding="ISO-8859-15"?>
<sometag>
</sometag>

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56942892

复制

相似问题

问无法使用bs4解析ISO-8859-15编码的XML
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用bs4解析ISO-8859-15编码的XMLEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用bs4解析ISO-8859-15编码的XML
EN