文章/答案/技术大牛

发布

社区首页 >问答首页 >无法使用BeautifulSoup访问HTML子标记

问无法使用BeautifulSoup访问HTML子标记
EN

Stack Overflow用户

提问于 2021-08-02 05:31:20

回答 1查看 85关注 0票数 0

我正在尝试从CNN的网站上获取文章元数据。他们的“头条新闻”部分下面有一个标签，开头如下：

<section class="zn zn-homepage1-zone-1....

在该部分下面，每一篇文章都位于如下标记中：

<article class="cd cd--card cd--article....

在类似的网站上，我可以通过以下方式访问“头条新闻”：

cnnUrl = "https://www.cnn.com"
cnnSoup = BeautifulSoup(requests.get(cnnUrl, headers=headers).content, "html.parser")

homepageZone1 = '[class*="zn zn-homepage1-zone-1"]'

for item in cnnSoup.select(homepageZone1):

...and for循环将允许我访问子标记，在那里我可以收集我需要的数据。一旦我有了item，我通常可以为CNN的头条新闻做这样的事情(这种格式有时会有变化)：

headline = item.find('h2').get_text()

headline将在哪里(目前为止)：

--一种为国家而设的培养皿

但是，在本例中，我得到了None标记的homepageZone1类型。我试着回到homepageZone1的父div

cnnEverything = '[class*="pg-no-rail pg-wrapper"]'

for item in cnnSoup.select(cnnEverything):

这里的Item提供了以下子标记，但这些标记中没有一个是我可以访问的子标记：

<div class="pg-no-rail pg-wrapper"><div class="pg__background__image_wrapper"></div><div class="l-container"></div><section class="zn--idx-0 zn-empty"> </section><section class="zn--idx-1 zn-empty"> </section><section class="zn--idx-2 zn-empty"> </section><section class="zn--idx-3 zn-empty"> </section><section class="zn--idx-4 zn-empty"> </section><section class="zn--idx-5 zn-empty"> </section><section class="zn--idx-6 zn-empty"> </section><section class="zn--idx-7 zn-empty"> </section><section class="zn--idx-8 zn-empty"> </section><section class="zn--idx-9 zn-empty"> </section><section class="zn--idx-10 zn-empty"> </section><div class="ad ad--epic ad--all t-dark"><div class="ad-ad_bnr_btf_02 ad-refresh-adbody" data-ad-id="ad_bnr_btf_02" id="ad_bnr_btf_02"></div></div></div>

我遗漏了什么？

python

html

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

发布于 2021-08-13 11:08:17

我认为您需要的HTML是在一个单独的请求中请求的，然后使用Javascript将它添加到主HTML中(这就是为什么您没有看到它)。

下面展示了如何从返回的JSON中的HTML请求国际版本：

from bs4 import BeautifulSoup
import requests

# International version
r = requests.get("https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl")
json_data = r.json()
html = json_data['html'].replace(r'\"', '"')
cnnSoup = BeautifulSoup(html, 'html.parser')

for heading in cnnSoup.find_all(['h2', 'h3']):
    print(heading.text)

给你以下标题：

Kandahar falls to Taliban
Militants take control of Afghanistan's second-largest city during an unrelenting sweep of the country, weeks before US troops are due to complete withdrawal
LIVE: UK defense chief worried about potential return of al Qaeda
Video allegedly shows Taliban celebrating after Kandahar gain
Afghanistan's quick unraveling threatens to stain Biden's legacy
...

URL是通过查看浏览器在加载页面时发出的请求找到的。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68616752

复制

相似问题

问无法使用BeautifulSoup访问HTML子标记
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用BeautifulSoup访问HTML子标记EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法使用BeautifulSoup访问HTML子标记
EN