我正在尝试编写一个简单的Python刮刀,以保存对TripAdvisor上特定位置的所有评论。
我作为示例使用的具体链接如下:
下面是我使用的代码,它应该打印相对的html
from bs4 import BeautifulSoup
import requests
url = "https://www.tripadvisor.com/Attraction_Review-g319796-d5988326-Reviews-or50-Museo_de_Altamira-Santillana_del_Mar_Cantabria.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
print(soup)如果我在控制台中运行这段代码,它将在requests.get(url)上挂起很长时间,不会有任何输出。使用另一个url (例如url = "https://stackoverflow.com/"),我立即得到正确显示的html。为什么TripAdvisor不能工作?我如何才能获得它的html?
发布于 2022-04-20 09:22:18
添加user-agent应该可以在第一步解决您的问题,因为有些站点提供了不同的内容,或者将其用于bot /自动检测--在浏览器中打开DEVTools --从您的请求中复制用户代理:
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(url,headers=headers)示例
from bs4 import BeautifulSoup
import requests
url = "https://www.tripadvisor.com/Attraction_Review-g319796-d5988326-Reviews-or50-Museo_de_Altamira-Santillana_del_Mar_Cantabria.html"
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get(url,headers=headers)
data = r.text
soup = BeautifulSoup(data)
data = []
for e in soup.select('#tab-data-qa-reviews-0 [data-automation="reviewCard"]'):
data.append({
'rating':e.select_one('svg[aria-label]')['aria-label'],
'profilUrl':e.select_one('a[tabindex="0"]').get('href'),
'content':e.select_one('div:has(>a[tabindex="0"]) + div + div').text
})
data输出
[{'rating': '5.0 of 5 bubbles',
'profilUrl': '/ShowUserReviews-g319796-d5988326-r620396152-Museo_de_Altamira-Santillana_del_Mar_Cantabria.html',
'content': "We were fortunate to get in without pre-booking.What a find. A UNESCO site in the middle of the countryside.The replication cave is so awesome and authentic, hard to believe it's not the real thing.The museum is beautifully curated, great for students, and anyone interested in archeology and the beginnings of human existence.Definitely worth visiting. We nearly missed out Read more"},
{'rating': '5.0 of 5 bubbles',
'profilUrl': '/ShowUserReviews-g319796-d5988326-r618358203-Museo_de_Altamira-Santillana_del_Mar_Cantabria.html',
'content': 'Beautiful site with great replica’s of the original cave, excellent exposition, poor film as an introduction however!The most urgent issue: long waiting because you need a slot to enter. This could be done 1000% better and in every decent museum it is done better! Staff probably civil servants with no great desire to make you enjoy the visit. Building urgently needs a revamp, no exposure at all!Read more'},...]https://stackoverflow.com/questions/71937012
复制相似问题