文章/答案/技术大牛

发布

社区首页 >问答首页 >Python / Beautiful元内容不匹配源

问Python / Beautiful元内容不匹配源
EN

Stack Overflow用户

提问于 2022-09-22 20:45:18

回答 2查看 56关注 0票数 0

我试图从一个网站上获取元数据内容。以下是代码：

import requests
from bs4 import BeautifulSoup

url = "https://discord.com/invite/midjourney"
result = requests.get(url=url)
soup = BeautifulSoup(result.content, 'html5lib')

target = soup.find("meta", property="og:description")
print(target)

这将返回：

<meta content="Discord is the easiest way to communicate over voice, video, and text.  Chat, hang out, and stay close with your friends and communities." property="og:description"/>

但是，查看页面源，内容是不同的，它包括成员的数量。我要找的是会员人数。

<meta property="og:description" content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,472,611 members" />

是否有某种类型的脚本动态地更改元内容？对于如何在元数据下获取实际数据，有什么想法吗？

python

html

web-scraping

beautifulsoup

回答 2

Stack Overflow用户

发布于 2022-09-22 20:51:46

尝试：

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers)     # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')

#1. extract all meta tags from the page, return list of tags
print(soup.select('meta'))

[<meta charset="utf-8"/>,
 <meta content="width=device-width, initial-scale=1.0, maximum-scale=3.0" name="viewport"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="description"/>,
 <meta content="summary_large_image" name="twitter:card"/>,
 <meta content="@discord" name="twitter:site"/>,
 <meta content="Join the Midjourney Discord Server!" name="twitter:title"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="twitter:description"/>,
 <meta content="Join the Midjourney Discord Server!" property="og:title"/>,
 <meta content="https://discord.com/invite/midjourney" property="og:url"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" property="og:description"/>,
 <meta content="Discord" property="og:site_name"/>,
 <meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" property="og:image"/>,
 <meta content="image/jpeg" property="og:image:type"/>,
 <meta content="512" property="og:image:width"/>,
 <meta content="512" property="og:image:height"/>,
 <meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" name="twitter:image"/>]

#2. extract all content of the meta tags, return list of text
content_only = [i.get('content') for i in soup.select('meta') if i.get('content')]

print(content_only)

['width=device-width, initial-scale=1.0, maximum-scale=3.0',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'summary_large_image',
 '@discord',
 'Join the Midjourney Discord Server!',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'Join the Midjourney Discord Server!',
 'https://discord.com/invite/midjourney',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'Discord',
 'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512',
 'image/jpeg',
 '512',
 '512',
 'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512']

#3. extract the members data that you need
members_content_only = list(set([i.get('content') for i in soup.select('meta') if i.get('content') and 'members' in i.get('content')]))

print(members_content_only)

['The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members']

票数 1

Stack Overflow用户

发布于 2022-09-23 04:37:31

确实有js在下面。我找到了一种使用selenium和bs4提取此信息的不同方法。

from bs4 import BeautifulSoup as bs
from selenium import webdriver
import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support.ui import WebDriverWait

options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)

url = "https://discord.com/invite/midjourney"
driver.get(url)

WebDriverWait(driver, 15)

page = driver.page_source
html = bs(page, 'html.parser') #print(html)

for script in html(["script", "style"]):
    script.extract()
text = html.get_text() 

lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)

final_string = text.replace(",","")
start = final_string.find("Online")+6
end = final_string.find("Members")-1
subs = final_string[start:end]
subs_final = int(subs)
print(subs_final)

输出：

这是获得我想要的东西的迂回之路。如果有更有效的方法来做到这一点的话。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73820492

复制

相似问题

问Python / Beautiful元内容不匹配源
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python / Beautiful元内容不匹配源EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python / Beautiful元内容不匹配源
EN