如何从下面的<span class="arabic_sanad arabic">和<span class="arabic_text_details arabic">中提取文本
<div class="arabic_hadith_full arabic"><span class="arabic_sanad arabic">حَدَّثَنَا أَبُو الْيَمَانِ، قَالَ أَخْبَرَنَا شُعَيْبٌ، قَالَ حَدَّثَنَا أَبُو الزِّنَادِ، عَنِ الأَعْرَجِ، عَنْ أَبِي هُرَيْرَةَ ـ رضى الله عنه ـ أَنَّ رَسُولَ اللَّهِ صلى الله عليه وسلم قَالَ </span>
<span class="arabic_text_details arabic">" فَوَالَّذِي نَفْسِي بِيَدِهِ لاَ يُؤْمِنُ أَحَدُكُمْ حَتَّى أَكُونَ أَحَبَّ إِلَيْهِ مِنْ وَالِدِهِ وَوَلَدِهِ "</span><span class="arabic_sanad arabic">.</span></div>我尝试了以下几种方法,但由于下面的错误,我失败了
print name2
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-11: ordinal not in range(128)代码:
url = "http://www.sunnah.com/bukhari/8"
parser = etree.HTMLParser()
html = etree.parse(url, parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
results = soup.findAll("div", {"class" : "actualHadithContainer"})
for result in results :
ar = result.find("div", {"class" : "arabic_hadith_full arabic"})
name2 = ar.get_text()
print name2发布于 2013-12-11 11:31:14
在打印字符串之前,尝试将字符串转换为unicode:
ar = result.find("div", {"class" : "arabic_hadith_full arabic"}, text=True) #only finds those with text inside
name2 = unicode(ar.get_text(), encoding='utf-8')
print name2发布于 2013-12-11 11:42:30
您必须像前面指出的那样将字符串转换为unicode。
'ResultSet' object has no attribute 'get_text' 为了防止这个错误,您必须检查ar是否有get_text方法。所发生的情况是,对于第一个节点有文本的旧代码,您会因为编码错误而得到错误。当您修复程序时,for循环继续,然后在没有文本的节点上运行,因此此时不存在get_text方法。像这样的事情应该有效:
for result in results :
ar = result.find("div", {"class" : "arabic_hadith_full arabic"})
if not getattr(ar, get_text):
continue
name2 = ar.get_text()
print u"{}".format(name2)https://stackoverflow.com/questions/20517776
复制相似问题