一般来说,我对编码是个新手,但对于我的第一个项目,我正在尝试创建一个监视器来监视Shopify站点的产品更改。
我的方法是在网上抓取公开共享的代码,并从那里向后工作来理解它,所以我在一个更广泛的类中获得了以下代码,这个类似乎通过循环遍历页面来获取products.json。
但是当我加载https://www.hanon-shop.com/collections/all/products.json,然后打印下面的项目列表时,前几个产品是不同的,这有什么意义呢?
def scrape_site(self):
"""
Scrapes the specified Shopify site and adds items to array
:return: None
"""
self.items = []
s = rq.Session()
page = 1
while page > 0:
try:
html = s.get(self.url + '?page=' + str(page) + '&limit=250', headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
output = json.loads(html.text)['products']
if output == []:
page = 0
else:
for product in output:
product_item = [{'title': product['title'], 'image': product['images'][0]['src'], 'handle': product['handle'], 'variants':product['variants']}]
self.items.append(product_item)
logging.info(msg='Successfully scraped site')
page += 1
except Exception as e:
logging.error(e)
page = 0
time.sleep(0.5)
s.close()发布于 2021-01-08 00:34:27
Requests接受一个参数字典,并且还有一个json方法,所以这可以更清晰。
import time
import requests
def scrape_site(self):
self.items = []
page = 1
with requests.Session() as s:
while True:
params = {
'page': page,
'limit': 250
}
try:
r = s.get(self.url, params=params, headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
r.raise_for_status()
output = r.json()
if not output:
break
for product in output['products']:
product_item = {
'title': product['title'],
'image': product['images'][0]['src'],
'handle': product['handle'],
'variants':product['variants']
}
self.items.append(product_item)
logging.info(f'Successfully scraped page {page}')
page += 1
time.sleep(1)
except Exception as e:
logging.error(e)
break
return self.itemshttps://stackoverflow.com/questions/65614589
复制相似问题