文章/答案/技术大牛

发布

社区首页 >问答首页 >我如何从谷歌图像搜索中抓取所有图像？

问我如何从谷歌图像搜索中抓取所有图像？
EN

Stack Overflow用户

提问于 2022-01-09 04:35:22

回答 3查看 785关注 0票数 -1

我试图学习神经网络的可视化，并希望以鸡为例。我想我可以从google图像上抓取所有鸡的图片，因为当我在google上搜索鸡的图片时，我得到了一堆不断向下滚动的结果。然而，在我刮完所有的图片后，我的图片只有20张。我认为问题是图片可能会被页面索引，但正如我说过的，在我的浏览器中，没有页面，只有一个页面一直向下滚动，所以我不知道如何在前20页之后刮掉剩下的图片。

from bs4 import *
import requests
import os 

os.mkdir('chickens')

r = requests.get('https://www.google.com/search?q=chickens&client=firefox-b-1-d&sxsrf=AOaemvLwoKYN8RyvBYe-XTRPazSsDAiQuQ:1641698866084&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiLp_bt3KP1AhWHdt8KHZR9C-UQ_AUoAXoECAIQAw&biw=1536&bih=711&dpr=1.25')



soup = BeautifulSoup(r.text, 'html.parser')
images = soup.findAll('img')
images = images[1:]

print(len(images))

web-scraping

beautifulsoup

回答 3

Stack Overflow用户

发布于 2022-01-09 05:04:54

不是一个完美的解决方案，但我认为它会奏效.

首先，googles服务器必须识别您是一个移动客户端，因此您在屏幕末尾有一个next按钮，请使用此链接进行搜索l=heirloom-hp.3..0i512i433j0i512i433i457j0i402l2j0i512l6.4571.6193.0.6957.8.7.0.1.1.0.134.611.6j1.7.0....0...1ac.1.34.heirloom-hp..0.8.613.OJ31YrPZ-B0‘。

然后，由于您有一个next按钮，所以您可以刮掉“next”按钮的href。

https://i.stack.imgur.com/nOJCG.png

在获得href之后，您可以执行另一个requests.get(新url)

再重复

想象一下我在说什么

如果您请求next按钮href，您将得到下一页。

票数 0

Stack Overflow用户

发布于 2022-08-12 10:52:27

这看起来像是一个半自动的抓取案例，所以您可以手动滚动页面到末尾，然后使用python来抓取所有的图像。

在向下滚动页面时，可以有一个“显示更多”按钮，您可以单击它并继续。在我的搜索中总共找到了764张图片，可以很容易地用python进行抓取。

findAll('img')将获得包括非结果图像在内的所有图像。您可以尝试其他库来进行抓取。

票数 0

Stack Overflow用户

发布于 2022-09-30 11:19:42

我们可以从内联JSON中抓取Google图像数据，因为您需要的数据是动态呈现的。

可通过正则表达式提取。为此，我们可以在页面源(Ctrl+U)中搜索第一个图像标题，以找到我们需要的匹配项，如果<script>>元素中有匹配项，则很可能是内联JSON。从那里我们可以提取数据。

首先，我们使用正则表达式查找代码中包含图像所需信息的部分：

# https://regex101.com/r/48UZhY/4
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

在接下来的步骤中，我们将返回数据的一部分，并且只选择图像所在的JSON的一部分(缩略图，原始的)：

matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
  
# https://regex101.com/r/VPz7f2/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)

然后找到缩略图：

# https://regex101.com/r/Jt5BJW/1
matched_google_images_thumbnails = ", ".join(
    re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(", ")

thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]

最后找到原始分辨率的图像：

# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))

# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)

full_res_images = [
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]

要获得所有图像，必须使用浏览器自动化，如selenium或playwright。此外，还可以使用定义要获取的页码的"ijn" URL parameter (大于或等于0)。

检查在线IDE中的代码。

import requests, re, json, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
  }

params = {
    "q": "chickens",              # search query
    "tbm": "isch",                # image results
    "hl": "en",                   # language of the search
    "gl": "us",                   # country where search comes fro
  }

html = requests.get("https://google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

google_images = []
all_script_tags = soup.select("script")

# https://regex101.com/r/eteSIT/1
matched_images_data = "".join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))

matched_images_data_fix = json.dumps(matched_images_data)
matched_images_data_json = json.loads(matched_images_data_fix)
  
# https://regex101.com/r/VPz7f2/1
matched_google_image_data = re.findall(r'\"b-GRID_STATE0\"(.*)sideChannel:\s?{}}', matched_images_data_json)

# https://regex101.com/r/Jt5BJW/1
matched_google_images_thumbnails = ", ".join(
    re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                   str(matched_google_image_data))).split(", ")

thumbnails = [bytes(bytes(thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for thumbnail in matched_google_images_thumbnails]

# removing previously matched thumbnails for easier full resolution image matches.
removed_matched_google_images_thumbnails = re.sub(
        r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', "", str(matched_google_image_data))

# https://regex101.com/r/fXjfb1/4
# https://stackoverflow.com/a/19821774/15164646
matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]", removed_matched_google_images_thumbnails)

full_res_images = [
        bytes(bytes(img, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape") for img in matched_google_full_resolution_images
]
    
for index, (metadata, thumbnail, original) in enumerate(zip(soup.select('.isv-r.PNCib.MSM1fd.BUooTd'), thumbnails, full_res_images), start=1):
    google_images.append({
        "title": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["title"],
        "link": metadata.select_one(".VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb")["href"],
        "source": metadata.select_one(".fxgdke").text,
        "thumbnail": thumbnail,
        "original": original
    })

print(json.dumps(google_images, indent=2, ensure_ascii=False))

示例输出

[
  {
    "title": "Chicken - Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Chicken",
    "source": "en.wikipedia.org",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTM_XkDqM-gjEHUeniZF4HYdjmA4G_lKckEylFzHxxa_SiN0LV4-6M_QPuCVMleDm52doI&usqp=CAU",
    "original": "https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Male_and_female_chicken_sitting_together.jpg/640px-Male_and_female_chicken_sitting_together.jpg"
  },
  {
    "title": "Chickens | The Humane Society of the United States",
    "link": "https://www.humanesociety.org/animals/chickens",
    "source": "humanesociety.org",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSYa5_tlXtxNpxDQAU02DWkwK2hVlB3lkY_ljILmh9ReKoVK_pT9TS2PV0-RUuOY5Kkkzs&usqp=CAU",
    "original": "https://www.humanesociety.org/sites/default/files/styles/1240x698/public/2018/06/chickens-in-grass_0.jpg?h=56ab1ba7&itok=uou5W86U"
  },
  {
    "title": "chicken | bird | Britannica",
    "link": "https://www.britannica.com/animal/chicken",
    "source": "britannica.com",
    "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQCl4LDGrSpsA6eFOY3M1ITTH7KlIIkvctOHuB_CbztbDRsdE4KKJNwArQJVJ7WvwCVr14&usqp=CAU",
    "original": "https://cdn.britannica.com/07/183407-050-C35648B5/Chicken.jpg"
  },
  # ...
]

或者您可以使用来自SerpApi的SerpApi。这是一个免费计划的付费API。不同之处在于，它将绕过Google的块(包括CAPTCHA)，不需要创建解析器并对其进行维护。

简单的代码示例：

from serpapi import GoogleSearch
import os, json

image_results = []
   
# search query parameters
params = {
    "engine": "google",               # search engine. Google, Bing, Yahoo, Naver, Baidu...
    "q": "chicken",                   # search query
    "tbm": "isch",                    # image results
    "num": "100",                     # number of images per page
    "ijn": 0,                         # page number: 0 -> first page, 1 -> second...
    "api_key": os.getenv("API_KEY")   # your serpapi api key
                                      # other query parameters: hl (lang), gl (country), etc  
}
    
search = GoogleSearch(params)         # where data extraction happens
    
images_is_present = True
while images_is_present:
    results = search.get_dict()       # JSON -> Python dictionary
    
# checks for "Google hasn't returned any results for this query."
    if "error" not in results:
        for image in results["images_results"]:
            if image["original"] not in image_results:
                    image_results.append(image["original"])
                
# update to the next page
        params["ijn"] += 1
    else:
        images_is_present = False
        print(results["error"])

print(json.dumps(image_results, indent=2))

输出：

[
  "https://www.spendwithpennies.com/wp-content/uploads/2020/07/1200-Grilled-Chicken-Breast-22.jpeg",
  "https://assets.bonappetit.com/photos/6282c9277e593c16bfea9c61/2:3/w_2430,h_3645,c_limit/0622-Sweet-and-Sticky-Grilled-Chicken.jpg",
  "https://kristineskitchenblog.com/wp-content/uploads/2021/04/grilled-chicken-1200-square-0400-2.jpg",
  "https://thecozycook.com/wp-content/uploads/2021/09/Creamy-Garlic-Chicken-f.jpg",
  "https://www.jocooks.com/wp-content/uploads/2020/01/instant-pot-chicken-breasts-1-10.jpg",
  "https://www.healthbenefitstimes.com/9/uploads/2018/04/Know-about-Chicken-and-health-benefits-702x459.png",
  "https://www.tasteofhome.com/wp-content/uploads/2022/03/Air-Fryer-Rotisserie-Chicken_EXPS_FT22_237368_F_0128_1.jpg?fit=700,1024",
  "https://www.militarytimes.com/resizer/-1j4zK-eaI1KPote1gyV1fw9XVg=/1024x0/filters:format(png):quality(70)/cloudfront-us-east-1.images.arcpublishing.com/archetype/BFPDC4MPLVGONPK2D5XXN7QOXI.png",
  # ...
]

如果您需要更多的代码解释，这里有一篇用Python抓取和下载Google图片博客文章。

免责声明，我为SerpApi工作。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70638466

复制

相似问题

问我如何从谷歌图像搜索中抓取所有图像？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何从谷歌图像搜索中抓取所有图像？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何从谷歌图像搜索中抓取所有图像？
EN