此代码获取一个网站,并在网页中下载所有.jpg图像。它只支持具有<img>元素和src包含.jpg链接的网站。
(在这里测试)
import random
import urllib.request
import requests
from bs4 import BeautifulSoup
def Download_Image_from_Web(url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
raw_text = r'links.txt'
with open(raw_text, 'w') as fw:
for link in soup.findAll('img'):
image_links = link.get('src')
if '.jpg' in image_links:
for i in image_links.split("\\n"):
fw.write(i + '\n')
num_lines = sum(1 for line in open('links.txt'))
if num_lines == 0:
print("There is 0 photo in this web page.")
elif num_lines == 1:
print("There is", num_lines, "photo in this web page:")
else:
print("There are", num_lines, "photos in this web page:")
k = 0
while k <= (num_lines-1):
name = random.randrange(1, 1000)
fullName = str(name) + ".jpg"
with open('links.txt', 'r') as f:
lines = f.readlines()[k]
urllib.request.urlretrieve(lines, fullName)
print(lines+fullName+'\n')
k += 1
Download_Image_from_Web("https://pixabay.com")发布于 2017-04-29 16:25:55
这是极其低效的:
K=0而k <= (num_ lines -1):name = random.randrange(1,1000) fullName = str(name) + ".jpg“并打开(‘lines s.txt’,'r')作为f: line= f.readlines()K urllib.request.urlretrieve(line,fullName)打印(line+fullName+‘\n’)k += 1
重新读取相同的文件num_lines时间,下载k!
顺便说一句,你真的需要把urls列表写到文件中吗?为什么不把它们列在单子上呢?即使您想要文件中的urls,您也可以将它们保存在内存中的列表中,而不读取该文件,只需写。
与其将所有代码都放在一个执行多项任务的函数中,不如将您的程序组织成更小的函数,每个函数都有一个单独的职责。
Python在PEP8中有一组定义良好的编码约定,其中许多在这里被违反。我建议阅读这份文件,并尽可能多地遵循。
发布于 2017-04-29 18:20:28
下面这个怎么样?
import random
import requests
from bs4 import BeautifulSoup
# got from http://stackoverflow.com/a/16696317
def download_file(url):
local_filename = url.split('/')[-1]
print("Downloading {} ---> {}".format(url, local_filename))
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return local_filename
def Download_Image_from_Web(url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('img'):
image_links = link.get('src')
if not image_links.startswith('http'):
image_links = url + '/' + image_links
download_file(image_links)
Download_Image_from_Web("https://pixabay.com")https://codereview.stackexchange.com/questions/162123
复制相似问题