从图像刮板下载的所有图像具有相同的文件大小为130 kb,并且已损坏并且无法在图像查看器中看到。
我真的不知道问题出在哪里。
任何人都请给我一些关于这件事的建议。
import requests
import parsel
import os
import time
url = 'https://movie-screencaps.com/movie-directory/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)
selector = parsel.Selector(response.text)
movie_list = selector.xpath('//div[@class="tagindex"]/ul/li')
for li in movie_list:
movie_name = li.xpath('.//a/text()').get().strip()
movie_url = li.xpath('.//a/@href').get()
print(movie_name, movie_url)
# dir = f'download/{movie_name}'
dir = f'{movie_name}'
if not os.path.exists(dir):
os.makedirs(dir)
page_response = requests.get(movie_url, headers=headers)
page_selector = parsel.Selector(page_response.text)
page_text = page_selector.xpath('//div[@class="wp-pagenavi"]/text()').get()
last_page = int(page_text.split(' ')[-1])
for page in range(1, last_page + 1):
page_url = f'{movie_url}/page/{page}'
print(f'===== Downloading from page {page} =====')
image_response = requests.get(url=page_url, headers=headers)
image_selector = parsel.Selector(image_response.text)
images_url_list = image_selector.xpath('//div[@align="center"]/a/@href').getall()
for image_url in images_url_list:
image_data = requests.get(url=page_url, headers=headers).content
# print(image_data)
file_name = image_url.split('/')[-1]
with open(f'{dir}/{file_name}', mode='wb') as f:
f.write(image_data)
print(file_name)
time.sleep(2)发布于 2022-04-15 08:57:55
问题是一个错误,您要为每个page_url获取image_url,而不是获取image_url。
...
for image_url in images_url_list:
image_data = requests.get(url=page_url, headers=headers).content
file_name = image_url.split('/')[-1]
...应:
...
for image_url in images_url_list:
# Typo is here...
image_data = requests.get(url=image_url, headers=headers).content
file_name = image_url.split('/')[-1]
...发布于 2022-04-15 08:57:36
我测试了你的代码,你只是犯了个小错误
改变:
image_data = requests.get(url=page_url, headers=headers).content至:
image_data = requests.get(url=image_url, headers=headers).content测试并运行良好:)
https://stackoverflow.com/questions/71881707
复制相似问题