我的代码只是创建空文件夹,而不是下载图像。
因此,我认为我需要修改它,以便图像可以清楚地下载。
我试着自己解决,但却想不出怎么做。
任何人都请帮帮我。谢谢!
import requests
import parsel
import os
import time
for page in range(1, 310): # Total 309pages
print(f'======= Scraping data from page {page} =======')
url = f'https://www.bikeexif.com/page/{page}'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)
html_data = response.text
selector = parsel.Selector(html_data)
containers = selector.xpath('//div[@class="container"]/div/article[@class="smallhalf"]')
for v in containers:
old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
if old_title is not None:
title = old_title.replace(':', ' -')
title_url = v.xpath('.//div[2]/h2/a/@href').get()
print(title, title_url)
if not os.path.exists('img\\' + title):
os.mkdir('img\\' + title)
response_image = requests.get(url=title_url, headers=headers).text
selector_image = parsel.Selector(response_image)
# Full Size Images
images_url = selector_image.xpath('//div[@class="image-context"]/a[@class="download"]/@href').getall()
for title_url in images_url:
image_data = requests.get(url=title_url, headers=headers).content
file_name = title_url.split('/')[-1]
time.sleep(1)
with open(f'img\\{title}\\' + file_name, mode='wb') as f:
f.write(image_data)
print('Download complete!!:', file_name)发布于 2022-03-04 23:23:57
这个页面使用JavaScript创建链接"download",但是JavaScript不能运行JavaScript,这就产生了问题。
但是页面似乎使用相同的urls在页面上显示图像,所以您可以使用//img/@src。
但是这又造成了另一个问题,因为页面使用JavaScript来处理"lazy loading"图像,而且只有第一个img有src。其他图片有data-src中的url (通常在滚动页面时Javascript将data-src复制到src ),所以您必须让data-src下载一些图像。
您需要这样的东西来获得@src (对于第一个图像)和@data-src (对于其他图像)。
images_url = selector_image.xpath('//div[@id="content"]//img/@src').getall() + \
selector_image.xpath('//div[@id="content"]//img/@data-src').getall()完整的工作代码(与其他小改动)
因为我使用Linux所以字符串img\\{title}创建了错误的路径。
所以我使用os.path.join('img', title, filename)在Windows,Linux,Mac上创建正确的路径。
import requests
import parsel
import os
import time
# you can define it once
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
for page in range(1, 310): # Total 309pages
print(f'======= Scraping data from page {page} =======')
url = f'https://www.bikeexif.com/page/{page}'
response = requests.get(url, headers=headers)
selector = parsel.Selector(response.text)
containers = selector.xpath('//div[@class="container"]/div/article[@class="smallhalf"]')
for v in containers:
old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
if old_title is not None:
title = old_title.replace(':', ' -')
title_url = v.xpath('.//div[2]/h2/a/@href').get()
print(title, title_url)
os.makedirs( os.path.join('img', title), exist_ok=True ) # it create only if doesn't exists
response_article = requests.get(url=title_url, headers=headers)
selector_article = parsel.Selector(response_article.text)
# Full Size Images
images_url = selector_article.xpath('//div[@id="content"]//img/@src').getall() + \
selector_article.xpath('//div[@id="content"]//img/@data-src').getall()
print('len(images_url):', len(images_url))
for img_url in images_url:
response_image = requests.get(url=img_url, headers=headers)
filename = img_url.split('/')[-1]
with open( os.path.join('img', title, filename), 'wb') as f:
f.write(response_image.content)
print('Download complete!!:', filename)https://stackoverflow.com/questions/71355569
复制相似问题