文章/答案/技术大牛

发布

社区首页 >问答首页 >get_text()有UnicodeEncodeError

问get_text()有UnicodeEncodeError
EN

Stack Overflow用户

提问于 2012-04-22 05:30:19

回答 1查看 1.4K关注 0票数 0

我有以下HTML：

<div class="dialog">
<div class="title title-with-sort-row">
    <h2>Description</h2>
    <div class="dialog-search-sort-bar">
    </div>
</div>
<div class="content"><div style="margin-right: 20px; margin-left: 30px;">
    <span class="description2">
        With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community. 
        She is made available under a Creative Commons License that gives endless opportunities for further development. 
        This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
        The result is a figure that has very good bending and morphing behavior.
        <br />
    </span>
</div>
</div>

我需要从class="dialog"的几个div中找到这个div，然后在span class="description2"中提取文本。

当我使用代码时：

description = soup.find(text = re.compile('Description'))
if description != None:
    someEl = description.parent
    parent1 = someEl.parent
    parent2 = parent1.parent
    description = parent2.find('span', {'class' : 'description2'})
    print 'Description: ' + str(description)

我得到：

<span class="description2">
    With Â“Antonia Polygon Â– StandardÂ”, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior.
    <br/>
</span>

如果我试图只获取文本，而不使用HTML &non字符，则使用

description = description.get_text()

我得到了一个(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'

如何将这个HTML块转换为直接的ascii？

ascii

beautifulsoup

python

unicode

回答 1

Stack Overflow用户

发布于 2012-05-07 12:31:04

#!/usr/bin/env python
# -*- coding: utf-8 -*-

foo = u'With Â“Antonia Polygon Â– StandardÂ”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.'

print foo.encode('ascii', 'ignore')

有三件事要注意。

首先是编码方法的'ignore'参数。它指示该方法删除不在所选编码范围内的字符(在本例中，ascii是安全的)。

其次，我们通过在字符串前面加上一个u，显式地将foo设置为unicode。

第三是显式文件编码指令：# -*- coding: utf8 -*-。

另外，如果你没有读到戴尼思在这个答案上的评论中的很好的观点，那么你就是一个愚蠢的笨蛋。如果输出要在HTML/XML中使用，则可以使用xmlcharrefreplace来代替上面提到的ignore。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/10265493

复制

相似问题

问get_text()有UnicodeEncodeError
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问get_text()有UnicodeEncodeErrorEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问get_text()有UnicodeEncodeError
EN