我有以下HTML:
<div class="dialog">
<div class="title title-with-sort-row">
<h2>Description</h2>
<div class="dialog-search-sort-bar">
</div>
</div>
<div class="content"><div style="margin-right: 20px; margin-left: 30px;">
<span class="description2">
With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community.
She is made available under a Creative Commons License that gives endless opportunities for further development.
This figure was developed by a group of talented members of the Poser community in a thirty-month effort.
The result is a figure that has very good bending and morphing behavior.
<br />
</span>
</div>
</div>我需要从class="dialog"的几个div中找到这个div,然后在span class="description2"中提取文本。
当我使用代码时:
description = soup.find(text = re.compile('Description'))
if description != None:
someEl = description.parent
parent1 = someEl.parent
parent2 = parent1.parent
description = parent2.find('span', {'class' : 'description2'})
print 'Description: ' + str(description)我得到:
<span class="description2">
With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community.
She is made available under a Creative Commons License that gives endless opportunities for further development.
This figure was developed by a group of talented members of the Poser community in a thirty-month effort.
The result is a figure that has very good bending and morphing behavior.
<br/>
</span>如果我试图只获取文本,而不使用HTML &non字符,则使用
description = description.get_text()我得到了一个(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'
如何将这个HTML块转换为直接的ascii?
发布于 2012-05-07 12:31:04
#!/usr/bin/env python
# -*- coding: utf-8 -*-
foo = u'With “Antonia Polygon – Standard”, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.'
print foo.encode('ascii', 'ignore')有三件事要注意。
首先是编码方法的'ignore'参数。它指示该方法删除不在所选编码范围内的字符(在本例中,ascii是安全的)。
其次,我们通过在字符串前面加上一个u,显式地将foo设置为unicode。
第三是显式文件编码指令:# -*- coding: utf8 -*-。
另外,如果你没有读到戴尼思在这个答案上的评论中的很好的观点,那么你就是一个愚蠢的笨蛋。如果输出要在HTML/XML中使用,则可以使用xmlcharrefreplace来代替上面提到的ignore。
https://stackoverflow.com/questions/10265493
复制相似问题