首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在Patreon上使用bs4进行Python web抓取

在Patreon上使用bs4进行Python web抓取
EN

Stack Overflow用户
提问于 2020-07-18 23:21:35
回答 1查看 468关注 0票数 1

我已经写了一个脚本,可以查找一些博客,看看是否添加了新的帖子。但是,当我尝试在Patreon上执行此操作时,我无法使用bs4找到正确的元素。

让我们以https://www.patreon.com/cubecoders为例。

假设我想要获取“成为赞助人”部分下的独家帖子数量,到目前为止是25个。

这段代码运行得很好:

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)

Output: 25

现在,我想要获取最新帖子的标题,标题是“AMP 2.0.2中的新特性-集成的SCP/SFTP服务器!”从现在开始。我在浏览器中检查了标题,发现它包含在一个包含类‘sc-1di2uql-1vYcWR’的span标记中。

但是,当我尝试运行这段代码时,我无法获取元素:

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)

Output: None

我已经尝试过用XPath或CSS选择器来获取元素,但是做不到。我认为这可能是因为站点首先使用JavaScript呈现,因此在元素正确呈现之前我无法访问它们。当我首先使用Selenium呈现站点时,我可以在打印出页面上的所有div标记时看到标题,但是当我只想获得第一个标题时,我无法访问它。

你们知道有什么变通办法吗?提前感谢!

编辑:在Selenium中,我可以这样做:

代码语言:javascript
复制
from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")


def find_text(divs):
    for div in divs:
        for span in div.find_elements_by_tag_name("span"):
            if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
                return span.text

            
print(find_text(divs))
browser.close()

Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!

当我从一开始就尝试使用类'sc-1di2uql-1 vYcWR‘搜索跨度时,它不会给出结果。会不会是find_elements方法没有更深入地寻找嵌套标记?

EN

回答 1

Stack Overflow用户

发布于 2020-07-18 23:51:46

您看到的数据是通过Ajax从他们的API加载的。您可以使用requests模块来加载数据。

例如:

代码语言:javascript
复制
import re
import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': url
}


with requests.session() as s:
    html_text = s.get(url, headers=headers).text
    campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
    data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    # print some information to screen:
    for d in data['data']:
        print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))

打印:

代码语言:javascript
复制
New in AMP 2.0.2 - Integrated SCP/SFTP server!                         2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal!                                         2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List                                    2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system                                    2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see?                         2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation!                             2020-05-21T12:19:23.000+00:00
Another day, another video tutorial!                                   2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes!                                        2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux                          2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist?      2020-05-04T01:14:39.000+00:00
Well that was unexpected...                                            2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support!                                      2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features                                    2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers!                      2020-03-11T14:53:31.000+00:00
Preparing for Enterprise                                               2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here!                                2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress!                               2020-02-26T17:53:53.000+00:00
Wallpaper!                                                             2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many.                          2020-02-06T15:26:09.000+00:00
Time for a new module!                                                 2020-01-07T13:41:17.000+00:00
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62970262

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档