首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Python / Beautiful元内容不匹配源

Python / Beautiful元内容不匹配源
EN

Stack Overflow用户
提问于 2022-09-22 20:45:18
回答 2查看 56关注 0票数 0

我试图从一个网站上获取元数据内容。以下是代码:

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

url = "https://discord.com/invite/midjourney"
result = requests.get(url=url)
soup = BeautifulSoup(result.content, 'html5lib')

target = soup.find("meta", property="og:description")
print(target)

这将返回:

代码语言:javascript
复制
<meta content="Discord is the easiest way to communicate over voice, video, and text.  Chat, hang out, and stay close with your friends and communities." property="og:description"/>

但是,查看页面源,内容是不同的,它包括成员的数量。我要找的是会员人数。

代码语言:javascript
复制
<meta property="og:description" content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,472,611 members" />

是否有某种类型的脚本动态地更改元内容?对于如何在元数据下获取实际数据,有什么想法吗?

EN

回答 2

Stack Overflow用户

发布于 2022-09-22 20:51:46

尝试:

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers)     # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')

#1. extract all meta tags from the page, return list of tags
print(soup.select('meta'))

[<meta charset="utf-8"/>,
 <meta content="width=device-width, initial-scale=1.0, maximum-scale=3.0" name="viewport"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="description"/>,
 <meta content="summary_large_image" name="twitter:card"/>,
 <meta content="@discord" name="twitter:site"/>,
 <meta content="Join the Midjourney Discord Server!" name="twitter:title"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="twitter:description"/>,
 <meta content="Join the Midjourney Discord Server!" property="og:title"/>,
 <meta content="https://discord.com/invite/midjourney" property="og:url"/>,
 <meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" property="og:description"/>,
 <meta content="Discord" property="og:site_name"/>,
 <meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" property="og:image"/>,
 <meta content="image/jpeg" property="og:image:type"/>,
 <meta content="512" property="og:image:width"/>,
 <meta content="512" property="og:image:height"/>,
 <meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" name="twitter:image"/>]

#2. extract all content of the meta tags, return list of text
content_only = [i.get('content') for i in soup.select('meta') if i.get('content')]

print(content_only)

['width=device-width, initial-scale=1.0, maximum-scale=3.0',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'summary_large_image',
 '@discord',
 'Join the Midjourney Discord Server!',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'Join the Midjourney Discord Server!',
 'https://discord.com/invite/midjourney',
 'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
 'Discord',
 'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512',
 'image/jpeg',
 '512',
 '512',
 'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512']

#3. extract the members data that you need
members_content_only = list(set([i.get('content') for i in soup.select('meta') if i.get('content') and 'members' in i.get('content')]))

print(members_content_only)

['The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members']
票数 1
EN

Stack Overflow用户

发布于 2022-09-23 04:37:31

确实有js在下面。我找到了一种使用selenium和bs4提取此信息的不同方法。

代码语言:javascript
复制
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support.ui import WebDriverWait

options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)

url = "https://discord.com/invite/midjourney"
driver.get(url)

WebDriverWait(driver, 15)

page = driver.page_source
html = bs(page, 'html.parser') #print(html)

for script in html(["script", "style"]):
    script.extract()
text = html.get_text() 

lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)

final_string = text.replace(",","")
start = final_string.find("Online")+6
end = final_string.find("Members")-1
subs = final_string[start:end]
subs_final = int(subs)
print(subs_final)

输出:

代码语言:javascript
复制
2496142

这是获得我想要的东西的迂回之路。如果有更有效的方法来做到这一点的话。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/73820492

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档