文章/答案/技术大牛

发布

社区首页 >问答首页 >从JavaScript渲染的网页中抓取图片

问从JavaScript渲染的网页中抓取图片
EN

Stack Overflow用户

提问于 2019-01-20 15:37:01

回答 1查看 909关注 0票数 2

我正试着把网页上的图片刮掉。它是使用JS呈现的，源代码中的图片链接不完整。这是图片所在的源：

<script language="javascript" type="text/javascript">
</script>
<div id="ImagesSection" class="ImagesSection">
<div id='HybridImageViewPrimaryImageDiv'>
<a href='/ItemImages/000450/18190933_1_lg.jpeg'  class="MagicZoom" data-options="  zoomMode:off; cssClass: dark-bg; zoomOn: click"  title='Multi-Faced Doll By Cark Bergner.' id="xxxyyyzzz"     ><img id='fullimage' src='/ItemImages/000450/18190933_1_med.jpeg'  alt='Multi-Faced Doll By Cark Bergner.' /></a>
</div>
<div style="margin-top:15px;width:300px;"> <button class="cfg-btn" onclick="MagicZoom.prev('xxxyyyzzz');return false;">Prev</button> <button class="cfg-btn" onclick="MagicZoom.next('xxxyyyzzz') ;return false;">Next</button>
</div><div style="margin-top:15px;" width="350px" >
 <a data-zoom-id="xxxyyyzzz" href="/ItemImages/000450/18190933_1_lg.jpeg"    data-image="/ItemImages/000450/18190933_1_med.jpeg"       >  <img    src="/ItemImages/000450/18190933_1_sm.jpeg"  height="60px"   />  </a>   
 <a data-zoom-id="xxxyyyzzz" href="/ItemImages/000450/18190933_2_lg.jpeg"    data-image="/ItemImages/000450/18190933_2_med.jpeg"       >  <img    src="/ItemImages/000450/18190933_2_sm.jpeg"  height="60px"   />  </a>   
 <a data-zoom-id="xxxyyyzzz" href="/ItemImages/000450/18190933_3_lg.jpeg"    data-image="/ItemImages/000450/18190933_3_med.jpeg"       >  <img    src="/ItemImages/000450/18190933_3_sm.jpeg"  height="60px"   />  </a>   
 <a data-zoom-id="xxxyyyzzz" href="/ItemImages/000450/18190933_4_lg.jpeg"    data-image="/ItemImages/000450/18190933_4_med.jpeg"       >  <img    src="/ItemImages/000450/18190933_4_sm.jpeg"  height="60px"   />  </a>   
 <a data-zoom-id="xxxyyyzzz" href="/ItemImages/000450/18190933_5_lg.jpeg"    data-image="/ItemImages/000450/18190933_5_med.jpeg"       >  <img    src="/ItemImages/000450/18190933_5_sm.jpeg"  height="60px"   />  </a>   
</div>
</div>

我想要提取的是以下图片：

/ItemImages/000450/18190933_1_sm.jpeg
/ItemImages/000450/18190933_2_sm.jpeg
/ItemImages/000450/18190933_3_sm.jpeg
/ItemImages/000450/18190933_4_sm.jpeg
/ItemImages/000450/18190933_5_sm.jpeg

这是我的代码：

import os
import shutil
import time
import requests
from bs4 import BeautifulSoup as bSoup
from selenium import webdriver

url = "https://auctions.morphyauctions.com/French_Fashion_Doll_with_Unusual_Body_-LOT450029.aspx"

driver = webdriver.Chrome(executable_path="/mypath/")

driver.get(url)

iterations = 0
while iterations <10:
    html = driver.execute_script("return document.documentElement.outerHTML")
    sel_soup = bSoup(html, 'html.parser')
    print (sel_soup.findAll('img'))
    images = []
    for i in sel_soup.findAll('img'):
        src = i['src']
        images.append(src)
    print(images)
    current_path = os.getcwd()
    for img in images:
        try:
            file_name = os.path.basename(img)
            img_r = requests.get(img, stream=True)
            new_path = os.path.join(current_path, 'images', file_name)
            with open(new_path, 'wb') as output_file:
                shutil.copyfilobj(img_r.raw, output_file)
            del img_r
        except:
            pass
    iterations +=1
    time.sleep(5)

运行此代码的结果是没有保存任何图像。任何帮助都将不胜感激。

javascript

python-3.x

selenium

beautifulsoup

scrape

回答 1

Stack Overflow用户

发布于 2019-01-20 23:23:08

图像的html不是由Javascript呈现的，所以你不需要selenium。使用带有re.compile的漂亮汤来匹配以/ItemImages/开头的a元素中的href。

请注意，它使用的是相对URL，您需要将域附加到图像URL。

base_url = 'https://auctions.morphyauctions.com'
url = base_url + "/French_Fashion_Doll_with_Unusual_Body_-LOT450029.aspx"

html = requests.get(url).text
sel_soup = BeautifulSoup(html, 'html.parser')

images = []
for a in sel_soup.findAll('a', href=re.compile(r'^/ItemImages/')):
    ahref = base_url + a['href'] # wee need to append the domain here
    images.append(ahref)
print(images)
current_path = os.getcwd()
for img in images:
    try:
        file_name = os.path.basename(img)
        img_r = requests.get(img)
        new_path = os.path.join(current_path, 'images', file_name)
        with open(new_path, 'wb') as output_file:
            output_file.write(img_r.content)
    except:
        print(ex)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54274458

复制

相似问题

问从JavaScript渲染的网页中抓取图片
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从JavaScript渲染的网页中抓取图片EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从JavaScript渲染的网页中抓取图片
EN