我试图从一个网站上获取元数据内容。以下是代码:
import requests
from bs4 import BeautifulSoup
url = "https://discord.com/invite/midjourney"
result = requests.get(url=url)
soup = BeautifulSoup(result.content, 'html5lib')
target = soup.find("meta", property="og:description")
print(target)这将返回:
<meta content="Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities." property="og:description"/>但是,查看页面源,内容是不同的,它包括成员的数量。我要找的是会员人数。
<meta property="og:description" content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,472,611 members" />是否有某种类型的脚本动态地更改元内容?对于如何在元数据下获取实际数据,有什么想法吗?
发布于 2022-09-22 20:51:46
尝试:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers) # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#1. extract all meta tags from the page, return list of tags
print(soup.select('meta'))
[<meta charset="utf-8"/>,
<meta content="width=device-width, initial-scale=1.0, maximum-scale=3.0" name="viewport"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="description"/>,
<meta content="summary_large_image" name="twitter:card"/>,
<meta content="@discord" name="twitter:site"/>,
<meta content="Join the Midjourney Discord Server!" name="twitter:title"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" name="twitter:description"/>,
<meta content="Join the Midjourney Discord Server!" property="og:title"/>,
<meta content="https://discord.com/invite/midjourney" property="og:url"/>,
<meta content="The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members" property="og:description"/>,
<meta content="Discord" property="og:site_name"/>,
<meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" property="og:image"/>,
<meta content="image/jpeg" property="og:image:type"/>,
<meta content="512" property="og:image:width"/>,
<meta content="512" property="og:image:height"/>,
<meta content="https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512" name="twitter:image"/>]
#2. extract all content of the meta tags, return list of text
content_only = [i.get('content') for i in soup.select('meta') if i.get('content')]
print(content_only)
['width=device-width, initial-scale=1.0, maximum-scale=3.0',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'summary_large_image',
'@discord',
'Join the Midjourney Discord Server!',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'Join the Midjourney Discord Server!',
'https://discord.com/invite/midjourney',
'The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members',
'Discord',
'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512',
'image/jpeg',
'512',
'512',
'https://cdn.discordapp.com/splashes/662267976984297473/4798759e115d2500fef16347d578729a.jpg?size=512']
#3. extract the members data that you need
members_content_only = list(set([i.get('content') for i in soup.select('meta') if i.get('content') and 'members' in i.get('content')]))
print(members_content_only)
['The official server for Midjourney, a text-to-image AI where your imagination is the only limit. | 2,473,729 members']发布于 2022-09-23 04:37:31
确实有js在下面。我找到了一种使用selenium和bs4提取此信息的不同方法。
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import requests
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.support.ui import WebDriverWait
options = FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
url = "https://discord.com/invite/midjourney"
driver.get(url)
WebDriverWait(driver, 15)
page = driver.page_source
html = bs(page, 'html.parser') #print(html)
for script in html(["script", "style"]):
script.extract()
text = html.get_text()
lines = (line.strip() for line in text.splitlines())
text = '\n'.join(line for line in lines if line)
final_string = text.replace(",","")
start = final_string.find("Online")+6
end = final_string.find("Members")-1
subs = final_string[start:end]
subs_final = int(subs)
print(subs_final)输出:
2496142这是获得我想要的东西的迂回之路。如果有更有效的方法来做到这一点的话。
https://stackoverflow.com/questions/73820492
复制相似问题