问Python刮板没有在某些子域上返回完整的html代码
EN

Stack Overflow用户

提问于 2022-05-28 17:47:00

回答 1查看 47关注 0票数 2

我正在拼凑一个沃尔玛的评论刮刀，它目前从大多数沃尔玛的网页上刮起了html，没有问题。当我尝试抓取一页评论时，它只返回页面代码的一小部分，主要是来自评论的文本和一些错误标记。有人知道问题出在哪里吗？

import requests
headers = {
    'Accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
    'Accept-Language': 'en-us',
    'Referer': 'https://www.walmart.com/',
    'sec-ch-ua-platform': 'Windows',
    }
cookie_jar = {
    '_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)

web-scraping

python

回答 1

Stack Overflow用户

发布于 2022-06-02 09:21:03

正如larsks已经评论过的，一些内容是动态加载的，例如，如果您向下滚动足够远的话。BeautifulSoup或请求不会加载整个页面，但是您可以使用Selenium解决这个问题。

Selenium所做的是在脚本控制的web浏览器中打开您的url，它允许您填写表单并向下滚动。下面是一个关于如何在BS4中使用Selenium的代码示例。

from bs4 import BeautifulSoup
from selenium import webdriver

# Search on google for the driver and save it in the path below
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")

# Here you open the url with the reviews
driver.get("https://www.example.com")
driver.maximize_window()

# This function scrolls down to the bottom of the website
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

# Now you can scrape the given website from your Selenium browser using:
html = driver.page_source
soup = BeautifulSoup(html)

此解决方案假设通过向下滚动页面加载评论。当然，你不必用BeautifulSoup来刮网站，这是个人喜好。如果有帮助请告诉我。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72417972

复制

相似问题

问Python刮板没有在某些子域上返回完整的html代码
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python刮板没有在某些子域上返回完整的html代码EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python刮板没有在某些子域上返回完整的html代码
EN