我正在学习如何从网站上抓取数据,但我被困在了这个网站上。由于隐私问题,我不能在这里发布链接,但我会尝试解释。
第一酒店的评级:
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>酒店2的评级:
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>第三酒店的评级:
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>就像这样,有100多家酒店,每个酒店都有不同的级别,所以我不能使用xpath,或者我对它不太了解。
我想把所有的评分,即"3.5“、"3.9”、“4.2”,都算上,但问题是,每一个评级都有不同的级别和不同的身份。
拜托,我只是个初学者,我想学点东西,所以有人能告诉我如何刮起酒店的收视率吗?如果你能给我举个例子,那就太好了。`
发布于 2014-07-25 18:30:48
您应该使用HTML解析器,有多种选择,但BeautifulSoup是使用和理解最简单的方法之一。下面是一个获取具有div类的rating-div元素文本的示例:
from bs4 import BeautifulSoup
data = """
<div>
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
</div>
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
</div>
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
</div>
</div>
"""
soup = BeautifulSoup(data)
print [r.get_text(strip=True) for r in soup.find_all('div', attrs={'class': 'rating-div'})]指纹:
[u'3.5', u'3.9', u'4.2']发布于 2014-07-25 18:14:19
使用lxml库
这将返回包含评级的所有divs的列表。
import urllib2
from lxml import etree
html = urllib2.urlopen(url)
html_text = etree.HTML(html.read())
rating_list = html_text.xpath('//*[@class="right"]/div')
#rating_lst = html_text.xpath('//*[@class="right"]') # choose accordingly, I dont have full source-code so commented out
for rate in rating_list:
print rate.xpath('text()')给定样本数据的代码
import urllib2
from lxml import etree
data = """
<div>
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
</div>
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
</div>
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
</div>
</div>
"""
# html = urllib2.urlopen(url) #use these two lines if getting source from a url
# html_text = etree.HTML(html.read())
html_text = etree.HTML(data)
rating_list = html_text.xpath('//*[@class="right"]/div')
for rate in rating_list:
print rate.xpath('text()')[0].strip('\n\t ')https://stackoverflow.com/questions/24961962
复制相似问题