文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Python从网站请求中获取完整的html

问使用Python从网站请求中获取完整的html
EN

Stack Overflow用户

提问于 2019-03-26 04:45:04

回答 2查看 507关注 0票数 3

我正在尝试发送一个http请求到一个网站(例如，Digikey)，并读回完整的html。例如，我使用这个链接：https://www.digikey.com/products/en?keywords=part_number来获取一个部件号，比如：https://www.digikey.com/products/en?keywords=511-8002-KIT。然而，我得到的并不是完整的html。

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.digikey.com/products/en?keywords=511-8002-KIT')
soup = BeautifulSoup(r.text)
print(soup.prettify())

输出：

<!DOCTYPE html>
<html>
 <head>
  <script>
   var i10cdone =(function(){ function pingBeacon(msg){ var i10cimg = document.createElement('script'); i10cimg.src='/i10c@p1/botox/file/nv-loaded.js?status='+window.encodeURIComponent(msg); i10cimg.onload = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; i10cimg.onerror = function(){ (document.head || document.documentElement).removeChild(i10cimg) }; ( document.head || document.documentElement).appendChild(i10cimg) }; pingBeacon('loaded'); if(String(document.cookie).indexOf('i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo')>=0) { document.cookie = 'i10c.bdddb=;path=/';}; var error=''; function errorHandler(e) { if (e && e.error && e.error.stack ) { error=e.error.stack; } else if( e && e.message ) { error = e.message; } else { error = 'unknown';}} if(window.addEventListener) { window.addEventListener('error',errorHandler, false); } else { if ( window.attachEvent ){ window.attachEvent('onerror',errorHandler); }} return function(){ if (window.removeEventListener) {window.removeEventListener('error',errorHandler); } else { if (window.detachEvent) { window.detachEvent('onerror',errorHandler); }} if(error) { pingBeacon('error-' + String(error).substring(0,500)); document.cookie='i10c.bdddb=c2-f0103ZLNqAeI3BH6yYOfG7TZlRtCrMwqUo;path=/'; }}; })();
  </script>
  <script src="/i10c@p1/client/latest/auto/instart.js?i10c.nv.bucket=pci&amp;i10c.nv.host=www.digikey.com&amp;i10c.opts=botox&amp;bcb=1" type="text/javascript">
  </script>
  <script type="text/javascript">
   INSTART.Init({"apiDomain":"assets.insnw.net","correlation_id":"1553546232:4907a9bdc85fe4e8","custName":"digikey","devJsExtraFlags":"{\"disableQuerySelectorInterception\" :true,  'rumDataConfigKey':'/instartlogic/clientdatacollector/getconfig/monitorprod.json','custName':'digikey','propName':'northamerica'}","disableInjectionXhr":true,"disableInjectionXhrQueryParam":"instart_disable_injection","iframeCommunicationTimeout":3000,"nanovisorGlobalNameSpace":"I10C","partialImage":false,"propName":"northamerica","rId":"0","release":"latest","rum":false,"serveNanovisorSameDomain":true,"third_party":["IA://www.digikey.com/js/geotargeting.js"],"useIframeRpc":false,"useWrapper":false,"ver":"auto","virtualDomains":4,"virtualizeDomains":["^auth\\.digikey\\.com$","^authtest\\.digikey\\.com$","^blocked\\.digikey\\.com$","^dynatrace\\.digikey\\.com$","^search\\.digikey\\.com$","^www\\.digikey\\.ca$","^www\\.digikey\\.com$","^www\\.digikey\\.com\\.mx$"]}
);
  </script>
  <script>
   typeof i10cdone === 'function' && i10cdone();
  </script>
 </head>
 <body>
  <script>
   setTimeout(function(){document.cookie="i10c.eac23=1";window.location.reload(true);},30);
  </script>
 </body>
</html>

我需要完整的html的原因是为了搜索特定的关键字，例如术语“无铅”或“通孔”出现在特定的零件编号结果中。我这样做不仅是为了Digikey，也是为了其他网站。

任何帮助都将不胜感激！

谢谢!

编辑：

感谢所有人的建议/答案。对于对此感兴趣的其他人，请在此处了解更多信息：Web-scraping JavaScript page with Python

python-3.x

beautifulsoup

python-requests

回答 2

Stack Overflow用户

发布于 2019-03-26 04:48:58

您正在查找的页面部分最有可能包含使用Javascript动态生成的内容。

在浏览器上访问view-source:https://www.digikey.com/products/en?keywords=part_number，你会看到请求正在获取完整的html --它只是没有执行Javascript代码。

如果右键单击，然后单击inspect (Chrome)，您将看到在执行javascript代码后创建的最终DOM。

要获得渲染的内容，您需要使用一个完整的web驱动程序，如Selenium，它能够执行Javascript来渲染整个页面。

下面是如何使用Selenium实现这一点的示例：

How can I parse a website using Selenium and Beautifulsoup in python?

In 8: from bs4 import BeautifulSoup in 9: from selenium import webdriver In 10: driver = webdriver.Firefox() In 11: driver.get('http://news.ycombinator.com') In 12: html = driver.page_source In 13: soup = BeautifulSoup(html) In 14: for tag In soup.find_all('title')：....：print tag.text ....：....：Hacker News

票数 2

Stack Overflow用户

发布于 2019-03-26 04:56:33

问题可能是因为页面的javascript没有时间运行，因此没有时间填充必要的HTML元素。一种解决方案是使用selenium实现webdriver：

from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get("https://www.digikey.com/products/en?keywords=511-8002-KIT")
source = chrome.page_source

通常，这样做的效率要低得多，因为您必须完全等待页面加载。解决这个问题的一种方法是寻找网站提供的各种API来直接访问您想要的数据，我建议您研究一下这些API可能是什么

以下是一些可用于直接获取数据的潜在API

https://api-portal.digikey.com/product

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/55346142

复制

相似问题

问使用Python从网站请求中获取完整的html
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python从网站请求中获取完整的htmlEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python从网站请求中获取完整的html
EN