文章/答案/技术大牛

发布

社区首页 >问答首页 >无法在google搜索中抓取元素，尽管可以在视图源代码中看到

问无法在google搜索中抓取元素，尽管可以在视图源代码中看到
EN

Stack Overflow用户

提问于 2016-11-25 10:44:59

回答 2查看 70关注 0票数 0

我正在尝试从谷歌搜索中抓取一个单词的定义

https://www.google.co.in/search?q=define%20subtle#cns=1

虽然所有的含义和例子都可以在我查看页面的源码时看到，但仍然无法抓取它们。

<div class="vk_gy">"his language expresses rich and subtle meanings"</div>

可以在源代码中看到，但是soup.find("div"，class_='vk_gy')返回NONE。

python

web-scraping

beautifulsoup

回答 2

Stack Overflow用户

发布于 2016-12-02 02:17:15

确保您正在将完整的html字符串加载到漂亮的汤中。你是怎么抓取html的？谷歌不喜欢你抓取他们的页面。如果您可以将完整的html加载到python中，您会发现您的命令应该可以工作。这是我的输出：

>>> print(soup.find("div", class_='vk_gy').prettify())
<div class="xpdxpnd vk_gy" data-mh="-1">
 <span>
  adjective:
  <b>
   subtle
  </b>
 </span>
 <span>
  ; comparative adjective:
  <b>
   subtler
  </b>
 </span>
 <span>
  ; superlative adjective:
  <b>
   subtlest
  </b>
 </span>
</div>

票数 0

Stack Overflow用户

发布于 2021-09-19 10:03:15

您正在查找.ubHt5c CSS选择器，例如：

examples = soup.select('.ubHt5c')
for example in examples:
   # other code..

# or 
for example in soup.select('.ubHt5c'):
    # other code..

# or list comprehension
examples = [example.text for example in soup.select('.ubHt5c')] # returns a list

确保你使用的是user-agent，因为默认的requests user-agent是python-requests，因此谷歌阻止了一个请求，因为它知道这是一个机器人，而不是一个“真正的”用户访问，你会收到一个不同的带有某种错误的超文本标记语言。User-agent通过将此信息添加到HTTP request headers来伪造用户访问。

我写了一篇关于how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions的博客。

在请求headers中传入user-agent

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)

代码和full example in the online IDE

import requests, lxml
from bs4 import BeautifulSoup

headers = {
  'User-agent':
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'swagger definition',
  'gl': 'us'
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

examples = [example.text for example in soup.select('.ubHt5c')]
print(examples)

# ['"he swaggered along the corridor"', '"they strolled around the camp with an exaggerated swagger"']

或者，您也可以通过使用SerpApi中的Google Direct Answer Box API来实现相同的功能。这是一个免费套餐的付费API。

在您的案例中，不同之处在于您不知道如何使其工作，然后随着时间的推移进行维护，相反，您只需要迭代结构化的JSON并快速获得所需的数据。

要集成的代码：

from serpapi import GoogleSearch

params = {
  "api_key": "YOUR_API_KEY",
  "engine": "google",
  "q": "swagger definition",
  "gl": "us",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

examples = results['answer_box']['examples']
print(examples)

# # ['"he swaggered along the corridor"', '"they strolled around the camp with an exaggerated swagger"']

免责声明，我为SerpApi工作。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40796967

复制

相似问题

问无法在google搜索中抓取元素，尽管可以在视图源代码中看到
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法在google搜索中抓取元素，尽管可以在视图源代码中看到EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法在google搜索中抓取元素，尽管可以在视图源代码中看到
EN