问从网页中抓取数据页属性
EN

Stack Overflow用户

提问于 2021-12-19 18:08:19

回答 1查看 58关注 0票数 -1

我想刮所有的产品从这个网站，没有分页按钮。当您滚动时，产品将自动加载。我的剧本只能刮前40个产品。我意识到产品动态加载在div标记的数据页属性中？我希望我的脚本不断地改变数据页的值并加载产品，但是我不知道如何去做。

from bs4 import BeautifulSoup
import requests
import pandas as pd
import time

url = 'https://www.positivepromotions.com/custom-blankets/c/navpp_1001_114/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
result = requests.get(url, headers=headers, timeout=5000)
data = result.content.decode()
soup1 = BeautifulSoup(data,'lxml')
## get the category
## get the conbtainer first container
subcategory = soup1.find('h1').text.strip()

itemlist = []
for soup in soup1.find_all('div', class_='row cat-prod-list'):
    for x in range(1,4):
        #for pages in soup.find_all('div', id='categoryProducts', attrs={'data-page': True}):
        for pages in soup.select('div[data-page]', id='categoryProducts'):
            print(pages['data-page'])
            for productList in pages.find_all('div', class_='col-sm-4 col-md-3 cat-prod-container'):
              title = productList.find('a', class_='product-title').text.strip()
              price = productList.find('span', class_='cat-price').text.strip().split('-',1)[0]
              sku = productList.find('div', class_='grid-prod-sku').text.strip()
              #productlist = soup.find_all('div', class_='prod-img-wrap')
              links = productList.find('a', class_='cat-prod-img',href=True)['href']
              image = productList.find('img')['data-src'].split('?',1)[0]

              items = {
                      'Title': title,
                      'Price': price,
                      'Sku': sku,
                      'Category': subcategory,
                      'Link': links,
                      'Image': image
                  }
              itemlist.append(items)
              ##print('Saving : ',title)
              #time.sleep(1)
            
# print total products found
print(len(itemlist))

#df = pd.DataFrame(itemlist)
##print(df.head(5))
#df.to_csv(subcategory+'.csv')
###

web-scraping

beautifulsoup

python

回答 1

Stack Overflow用户

发布于 2021-12-19 18:23:16

我认为最好的方法是使用硒库。它很容易使用，所以不要担心实现。你需要：

安装它
创建您的selenium驱动程序(最好是铬驱动程序)
导航到URL
通过他们的XPath (在doc中已经提到)获取数据。
滚动到最后一个产品，使页面自动加载。

在最后一步中，使用这

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70414081

复制

相似问题

问从网页中抓取数据页属性
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从网页中抓取数据页属性EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从网页中抓取数据页属性
EN