文章/答案/技术大牛

发布

社区首页 >问答首页 >BeautifulSoup不返回Twitch.tv视图计数

问BeautifulSoup不返回Twitch.tv视图计数
EN

Stack Overflow用户

提问于 2018-10-06 17:17:23

回答 1查看 929关注 0票数 3

我试图使用Python在www.twitch.tv/目录上搜索查看器。我已经尝试了基本的BeautifulSoup脚本：

url= 'https://www.twitch.tv/directory'
html= urlopen(url)
soup = BeautifulSoup(url, "html5lib") #also tried using html.parser, lxml
soup.prettify()

这给了我html，而没有显示实际的查看器编号。

然后，我尝试使用param数据。From this thread

param = {"action": "getcategory",
        "br": "f21",
        "category": "dress",
        "pageno": "",
        "pagesize": "",
        "sort": "",
        "fsize": "",
        "fcolor": "",
        "fprice": "",
        "fattr": ""}

url = "https://www.twitch.tv/directory"
# Also tried with the headers parameter headers={"User-Agent":"Mozilla/5.0...
js = requests.get(url,params=param).json()

但是我得到了一个JSONDecodeError: Expecting value: line 1 column 1 (char 0)错误。

从那时起，我转向了selenium。

driver = webdriver.Edge()
url = 'https://www.twitch.tv/directory'
driver.get(url)
#Also tried driver.execute_script("return document.documentElement.outerHTML") and innerHTML
html = driver.page_source
driver.close()
soup = BeautifulSoup(html, "lxml")

这些结果与我从标准BeautifulSoup调用中得到的结果相同。

任何帮助刮除视图计数将不胜感激。

javascript

python

web-scraping

beautifulsoup

twitch

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-10-06 17:44:53

第一次加载时，页面中不存在统计数据。该页面向https://gql.twitch.tv/gql发出graphql请求，以获取游戏数据。当用户未登录到graphql请求时，请求请求查询AnonFrontPage_TopChannels。

下面是python中的一个工作请求：

import requests
import json

resp = requests.post(
    "https://gql.twitch.tv/gql",
    json.dumps(
        {
            "operationName": "AnonFrontPage_TopChannels",
            "variables": {"platformType": "all", "isTagsExperiment": True},
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "d94b2fd8ad1d2c2ea82c187d65ebf3810144b4436fbf2a1dc3af0983d9bd69e9",
                }
            },
        }
    ),
    headers = {'Client-Id': 'kimne78kx3ncx6brgo4mv6wki5h1ko'},
)

print(json.loads(resp.content))

我在请求中包括了客户身份。这个id似乎并不是会话中唯一的，但是我想this会过期，所以这可能不会永远起作用。您将不得不检查未来的graphql请求，并在将来获取一个新的客户端Id，或者找出如何以编程方式从页面中抓取一个。

这个请求实际上似乎是顶级直播频道部分。以下是获取视图计数和标题的方法：

edges = json.loads(resp.content)["data"]["streams"]["edges"]
games = [(f["node"]["title"], f["node"]["viewersCount"]) for f in edges]

# games:
[
    ("Let us GAME", 78250),
    ("(REBROADCAST) Worlds Play-In Knockouts: Cloud9 vs. Gambit Esports", 36783),
    ("RuneFest 2018 - OSRS Reveals !schedule", 35042),
    (None, 25237),
    ("Front Page of TWITCH + Fortnite FALL SKIRMISH Training!", 22380),
    ("Reckful - 3v3 with barry and a german", 20399),
]

您需要检查chrome网络检查器并找出其他请求的结构，以获得更多数据。

下面是目录页面的一个示例：

import requests
import json

resp = requests.post(
    "https://gql.twitch.tv/gql",
    json.dumps(
        {
            "operationName": "BrowsePage_AllDirectories",
            "variables": {
                "limit": 30,
                "directoryFilters": ["GAMES"],
                "isTagsExperiment": True,
                "tags": [],
            },
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "75fb8eaa6e61d995a4d679dcb78b0d5e485778d1384a6232cba301418923d6b7",
                }
            },
        }
    ),
    headers={"Client-Id": "kimne78kx3ncx6brgo4mv6wki5h1ko"},
)

edges = json.loads(resp.content)["data"]["directoriesWithTags"]["edges"]
games = [f["node"] for f in edges]

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52681511

复制

相似问题

问BeautifulSoup不返回Twitch.tv视图计数
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BeautifulSoup不返回Twitch.tv视图计数EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BeautifulSoup不返回Twitch.tv视图计数
EN