首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >无法从网页上刮取产品标题

无法从网页上刮取产品标题
EN

Stack Overflow用户
提问于 2021-06-08 13:16:22
回答 2查看 3.2K关注 0票数 7

我试图使用请求模块在这个网页中抓取产品的标题,但是即使产品标题在页面源(ctrl + U)中,脚本也总是抛出AttributeError

我尝试过(throws AttributeError):

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

link = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}

res = requests.get(link, headers=headers)
soup = BeautifulSoup(res.text,"lxml")
try:
    product_title = soup.select_one("h1 > span").get_text(strip=True)
except AttributeError: product_title = ""
print(product_title)

预期产出:

代码语言:javascript
复制
Gigabyte GeForce RTX 3070 Aorus Master 8GB OC GPU

我怎么才能从那个网页上刮到产品的标题呢?

PS我也尝试过这个库的云刮器,但没有运气。

编辑:

这就是我在运行以下代码时得到的raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url

代码语言:javascript
复制
import cfscrape

url = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
}

token, agent = cfscrape.get_tokens(url, headers=headers)
print(token, agent)

我知道,如果我能够从上面的尝试中获得令牌的值,我可以使用cookies中的cf_clearance值来访问页面内容。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-06-13 17:55:28

--这只是一个占位符,对于研究这个 Cloudflare绕过问题的其他人来说可能是有用的。

用例

从使用任何一个Cloudflare CAPTCHA或Javascript挑战来增强保护的网站中收集信息。

Python请求

使用标准的Python服务,Cloudflare服务将返回一个403禁止的错误代码。

代码语言:javascript
复制
import requests

URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}

response = requests.get(URL, headers=headers)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden

如果我们查看response.headers,我们可以看到Cloudflare服务器正在将我们的请求代理到目标URL。

代码语言:javascript
复制
...continued from the code above
for key, value in response.headers.items():
    print(f'KEY NAME: {key}')
    print(f'KEY VALUE: {value}')
    print('-----------------------')
    # output 
    KEY NAME: Date
    KEY VALUE: Sun, 13 Jun 2021 16:39:03 GMT
    -----------------------
    KEY NAME: Content-Type
    KEY VALUE: text/html; charset=UTF-8
    -----------------------
    KEY NAME: Transfer-Encoding
    KEY VALUE: chunked
    -----------------------
    KEY NAME: Connection
    KEY VALUE: close
    -----------------------
    KEY NAME: Permissions-Policy
    KEY VALUE: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
    -----------------------
    KEY NAME: Cache-Control
    KEY VALUE: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
    -----------------------
    KEY NAME: Expires
    KEY VALUE: Thu, 01 Jan 1970 00:00:01 GMT
    -----------------------
    KEY NAME: X-Frame-Options
    KEY VALUE: SAMEORIGIN
    -----------------------
    KEY NAME: cf-request-id
    KEY VALUE: 0aa7d6c7c4000007ff7201b000000001
    -----------------------
    KEY NAME: Expect-CT
    KEY VALUE: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
    -----------------------
    KEY NAME: Set-Cookie
    KEY VALUE: __cf_bm=72427e2af66c7177feeb88a847fae9c26b66c681-1623602343-1800-AZAmqDfaHZU8IXOH/i3BBVf8pGcws0Gc1Tln5yKUepe3utWlCpagxvALDW6wiHd2pli9Zl45Mg8gC/QSoUFhoes=; path=/; expires=Sun, 13-Jun-21 17:09:03 GMT; domain=.cclonline.com; HttpOnly; Secure; SameSite=None
    -----------------------
    KEY NAME: Vary
    KEY VALUE: Accept-Encoding
    -----------------------
    KEY NAME: Server
    KEY VALUE: cloudflare
    -----------------------
    KEY NAME: CF-RAY
    KEY VALUE: 65ecc0b9383b07ff-ATL
    -----------------------
    KEY NAME: Content-Encoding
    KEY VALUE: gzip
    -----------------------

如果我们查看与Python请求关联的response.text,我们可以看到与Cloudflare保护相关的其他证据。

代码语言:javascript
复制
...continued from the code above
print(response.text)
# output

truncated...

<title>Please Wait... | Cloudflare</title>
<meta name="captcha-bypass" id="captcha-bypass" />

truncated...

<form class="challenge-form managed-form" id="challenge-form" action="/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/?__cf_chl_managed_tk__=7d4597196bb14948881846ca16631b64c55f06d3-1623602854-0-AcX2yHJM2sCalL03Opq9RiFjASeYE0Xs0KG4XeG1lezzhzEyu-bL8xsdHuEjNIIKaJkWEmha4DhViRlqWEP_HREOdA8YAY7nnNkBAHbNMs6p_AWgYNLPnSNM13PO2I96hdABtoaaKjOzV4AyJQJ8f08XEW2flN97rPxIMeiR0tI1a3PiON2dN9E_YCyneAuCUfaYWUNGL0Bqd_rkYp3Ljb2zk_kGWizckr1fvhodSEjEB-ByYVK8ODNox2oZ4XPcmCYJ6UNDmbNc406BjMeTf3e72Z7vgdnt3V714VrGN4w_Y4VQ2X1V0OVKUKEH9B5Rxa_4fEZiMAAdxZ6idg69JYMKftuuLemr53n5WAwTwyX2G7N9jmjtarxEQcCqoj9oY7oSFwQTb3ZVb9i5EeavKaE1_67wxpyPybNidBDxhLazDEMefPZGDsV9mSziuIQ90nS5vn-7sUvC8BJATNWPbh6OduchXy-QcMeYhurtukUCm3oDQMP7r4g4qvDCWI3_-ku7u-B4G2XI2kwM_tLVEZiH5uHPjWpHE6eFWohiCTxd4p7vHg7z5ug9feRalYqu3GfInd82GZ-j-7nCqLDmPh2Sjlu6sJGfopqM3XlBrd1kgRZU3Z4uw6JIIqfH0M6K3_weTtem0-Z1zhDUBbVDvgJVeHNNh_bTxHGWbFB0f80tALBMbt67RftO5u1XBUZ-TRftteXBwJ8gmYzOZTo4lQOGQ_771urYXsTuW_sp8PwxvQpEyCnY8zD8dmVz0-waZhOet8MQMwduN2nfGUOrCMwUYO9McsBqzfsT5PJZVkDm-rYBBwqw0PIwvm1-N8ymAjrpSN6ps4FerqK1uQOo77FLiOq8JCOVqdETIZ9NO07A" method="POST" enctype="application/x-www-form-urlencoded">

truncated...

 <input type="hidden" name="r" value="d5db3eb87c9b42ec7f076916611c296abfd2c842-1623602854-0-AXz7+uyFGbpY1aOLgfZMm0oIiiepEo5I5QmdTnvMmL9fDUc4OMEa2CNYXsbHVjOzdYO+PqegjpNL8R3D9LhDc+Xo0y0ira1zO7foozPj0qdcUpNNr2ZOHqgUyKws6dVgeBNUdF+v9+eNFxSHxOhc4DWDLIw9guBqJg1GaBjG3QCQdZmyFbPxXUQtXTFmtVVuqch9qBFLa/u9deMBCxCWi5fyKoOINtyBtyT4p79ITb9T+6T7fl2epMXNHO6xBW2dPnDP1FmjUQ04CG3ydOaDS5qoSFMPr4InVbMcI2NbQYJYPfWjmncMaga6K+NMNvv8wtiyXpEeWsUgFFeQoDJEuvLI+wkI8mT+vXAnXd8LWy9TpEDVK6uxtLF2C75aU7qJxI9RKANGluWYUXeqE1tXgppgZraIGfRWNPVsQZzqd6SK+Zsg8x8UH7oRRD9blMMPMaekcFQ3zT8QQ5BzEc8wEQ68OhmKbFuAeV/YhhWshpm808gcVHIFH17I+0MEidfV/ny5wBSRZJyQUfOSU9iAv/minNWF6ZA21E/+Zebda2lVF6gyEHgrjecxuOxzY2I2qMm0RCEHO4oSk/X8EtMYirGCQ3FD8PzSvZYx+34QZutXFLVvqT3CR/UcsXybG6wllvIGvZ6j/gdoAwfcS27MyO4mXDMk6TfDqdi+NqlItwgWNdp461RQmPdChRp9kKEy3sTsIAGW9Ky1k/xYYcTvLDpCGFICBEm2JhDyp/FEF9UBYia7XJ4aUEncSUeViqaQ8bXpPk6kEPH5RYEcfaX3he0W5aZHHIGcjgOFZsuu45MWREvbHjO+RcPMib4L+lU1cKQoYx+w5b9e4AJiRnGog3a6E3i/L75bSnk7L3qA+DofeeccI/RPitqDb/lX31fkhwHfdRWoLt+OILsUfHNni/olGABEUDruwDVpR32xlieS7vekdmQL3oOu5BkAOXoObbb+2nzo6Dvgw7M7rb4muC7US4yCTK0BeGSfu2XvFta228IoGIGa8BjUcb09K6nRdWUwrCXLYS+vIJTegKMeyxlMKNXw7vIaPh9vht4zblhN0bqkN/m/opyXEtzLfhsLuEkHdQ0GhTUk2nYgHeKX0j6eW0uQhAD/9TLf6UgILCk0+nQvXfEffQCCe/hEfBfkAgiPhr1E3uyPB4vp6Fpy2nnkkzmGv/3P5wg6afKDmU2Ic32u3U47hOlghnc7NlbzFb5R8Tx6vWrkXMDYHdOaaudLtPp5N9y1ceXXaMNAFMVmoqaiHWuV4KN+2rLolSOGUEFNEoRN6Jw9mlq/zniK23gQ2lSy+wIHPRGvRCxhRr5DeskvLgyviAk7IhLH3zMpqxd7i05BIPV3sB8orBzVE4Rqmam3evpTVEMMFRDt/Ol6XUJi66QrLgJyusuv5xL4pKPWZrw/hn3a5j0zrrChUbvM3S94BeWiJS48hA35S9mXLfaKMAZTYZTMqhbW77qwUuquwW2lPEAgSPY7WvvnNRUPXsS1KCPpiuE0TuDFaZQi9UTqlzkQIq84wqVRjQZ0Y0m3PQeI2BbJZ8woKIKiABWbSOuV/kyy5H4L+RVL7Jmc2ndl3HaQ4XlnwDmTuK/gMbRvZe1taVHOyYsXmfEY4XkiaDUneGjBEGnWyiv49DtiG2TLmmIpP1UITmO677eDSoNLHpxp1guMjwL5m3XHKOFNtpLzuiVH4UJdgTjtnmbGHmKGtyy0k3GPZrwyVkZRyS+FZZ5WhTs05rhS+1sg3oDCyTbWeYX9T4VVswRjxq1HsyH8NdZTN4f9BTn9VU0+9JnVAkgLM4JCkV6wqwQf+QMK/MaYWvBwSjYgFUxdEdT7Rls85/M+4GxcaGsiNmsA5Q==">
  <input type="hidden" name="cf_captcha_kind" value="h">
  <input type="hidden" name="vc" value="4845a44c225a1fa6a61708e11b613971">

truncated...

 <script type="text/javascript">
    //<![CDATA[
    (function(){
        var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
        var trkjs = isIE ? new Image() : document.createElement('img');
        trkjs.setAttribute("src", "/cdn-cgi/images/trace/managed/js/transparent.gif?ray=65eccd326d61f331");
        trkjs.id = "trk_managed_js";
        trkjs.setAttribute("alt", "");
        document.body.appendChild(trkjs);
        var cpo=document.createElement('script');
        cpo.type='text/javascript';
        cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/managed/v1?ray=65eccd326d61f331";
        document.getElementsByTagName('head')[0].appendChild(cpo);
    }());
    //]]>
    </script>

上面的信息显示,发送到目标URL的Python请求被Cloudflare服务器拦截,这是对请求的挑战。在允许初始请求继续之前,必须绕过这一挑战。

刮包

OP指出,他们试图使用刮刀 Python包从Cloudflare服务器获取令牌信息。

标准的cfscrape提供与Python请求相同的响应。

代码语言:javascript
复制
import cfscrape

URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}

scraper = cfscrape.create_scraper(delay=10)
response = scraper.get(URL, headers=headers)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden

cfscrape包还支持函数get_tokens和get_cookie_string,但这两个函数都产生了403禁止的错误代码。

来自cfscrape源代码:

代码语言:javascript
复制
def is_cloudflare_captcha_challenge(resp):
        return (
            resp.status_code == 403
            and resp.headers.get("Server", "").startswith("cloudflare")
            and b"/cdn-cgi/l/chk_captcha" in resp.content
        )


# the function above is called from this

def request(self, method, url, *args, **kwargs):
        resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)

        # Check if Cloudflare captcha challenge is presented
        if self.is_cloudflare_captcha_challenge(resp):
            self.handle_captcha_challenge(resp, url)

        # Check if Cloudflare anti-bot "I'm Under Attack Mode" is enabled
        if self.is_cloudflare_iuam_challenge(resp):
            resp = self.solve_cf_challenge(resp, **kwargs)

        return resp

handle_captcha_challenge函数试图解决Cloudflare javascript的挑战。代码的这一部分是失败的原因。目前还不清楚该部分的哪些部分失败了,因此需要进行更多的研究和测试。

请注意:根据软件包的开发人员不再支持该模块。,。

云刮器包

OP还指出,他们试图使用云刮器 Python包从Cloudflare服务器获取令牌信息。cloudscraper是从cfscrape中分叉出来的,所以语法是类似的。

cloudscraper获得与cfscrape相同的403禁止错误代码。

代码语言:javascript
复制
import cloudscraper

URL = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX' \
      '-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'}

scraper = cloudscraper.create_scraper()
response = scraper.get(URL)
print(f'Status Code: {response.status_code}')
print(f'Status Code Reason: {response.reason}')
# output
Status Code: 403
Status Code Reason: Forbidden

云刮器包还支持函数get_tokens和get_cookie_string,但这两个函数都产生了403个禁止的错误代码。

硒包装

OP还指出,他们试图使用软件包。

特别注意:在我的测试期间,我在Google、Mozilla和Microsoft的网页驱动程序中使用了selenium。

在过去的12个月内,这些选项可以用于硒,以绕过云的保护。不幸的是,这些选项在今天行不通。

代码语言:javascript
复制
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
# additional disable-blink-features are available in Chromium source code on Github
chrome_options.add_argument("--disable-blink-features=AutomationControlled")

下面是一个selenium代码示例,它使用带有上述开关的Chrome。

代码语言:javascript
复制
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"

driver.get(URL)

上面的代码打开一个浏览器会话,它面临Cloudflare Javascript挑战。在使用上述开关进行测试期间,此挑战不会停止。Cloudflare Ray ID,它是每个请求的唯一id,在我手动终止会话之前旋转了很多次。

需要带硒丝才能获得状态代码

下面是一个无头模式,它还显示了对目标URL的403个禁止错误代码。会议还表明,hcaptcha.com的反机器人技术现在是混合的.

代码语言:javascript
复制
from seleniumwire import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--headless")
chrome_options.add_argument(
    "user-agent=Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("--disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver', options=chrome_options)
URL = "https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934"
driver.get(URL)

for request in driver.requests:
    print(f'Status Code: {request.response}')
    print(f'Host Name: {request.host}')
    # output 
    Status Code: 403 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 200 
    Host Name: www.cclonline.com
    -----------------------
    Status Code: 302 
    Host Name: hcaptcha.com
    -----------------------
    Status Code: 200 
    Host Name: newassets.hcaptcha.com
    -----------------------
driver.quit()

一个使用UI的标准Chrome会话显示了一个带有"I am human“复选框的iFrame。

如果手动单击该按钮或使用selenium会话,系统会提示我使用图片captcha,这增加了绕过Cloudflare保护的复杂性。

cf_clearance cookie

在解决Cloudflare、CAPTCHA或Javascript挑战时,将在客户端浏览器中设置cf_clearance cookie。Cloudflare客户端可以配置cf_clearance cookie的默认生存期为30分钟。

如果在Google浏览器中手动打开OP的目标URL,可以使用开发人员工具看到cf_clearance cookie

根据此会话开始的UTC时间和为该cookie设置的过期日期,cf_clearance cookie生存期似乎设置为60分钟。

到目前为止,我还没有找到使用Python.提取这个cookie的方法。

票数 12
EN

Stack Overflow用户

发布于 2021-06-12 06:59:32

一些你需要在标题中的请求!

  • Cookie "cf_clearance“
  • 用户代理

示例

获取cookie的步骤

  1. 开式铬devtools
  2. 切换到选项卡“网络”
  3. 复制请求标头

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

link = 'https://www.cclonline.com/product/334427/GV-N3070AORUS-M-8GD-1-1/Graphics-Cards/Gigabyte-AORUS-GeForce-RTX-3070-MASTER-8GB-Overclocked-Graphics-Card-rev-1-1-/VGA5934/'
h = '''cookie: cf_clearance=718abb68f064be7612ee987ab9d8bc755016f3c2-1623437208-0-150
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4539.2 Safari/537.36'''
h = dict(l.split(': ') for l in h.split('\n') if ': ' in l)
res = requests.get(link, headers=h)
soup = BeautifulSoup(res.text, "lxml")
try:
    product_title = soup.select_one("h1 > span").get_text(strip=True)
except AttributeError:
    product_title = ""
print(product_title)
票数 -3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67887751

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档