问清除标签bs4
EN

Stack Overflow用户

提问于 2018-11-03 05:46:07

回答 1查看 99关注 0票数 0

所以我尝试检索p标签中的信息，我不想要任何东西，else.How，我可以这样做吗？这就是我到目前为止所做的。我正在获取我不需要的附加信息

 page = requests.get('https://www.theguardian.com/world/2016/jun/30/mexican- 
woman-117-years-old-dies-birth-certificate')
soup = BeautifulSoup(page.text, 'html.parser')
#soup.i.decompose()

content_list = soup.find('body')
# Pull text from all instances of <p> tag within BodyText div
content_list_items = content_list.find_all('p')    

for content_list in content_list_items:
    print(content_list.prettify())

python

python-3.x

beautifulsoup

回答 1

Stack Overflow用户

发布于 2018-11-03 06:23:55

我不确定你所说的“额外信息”是什么意思，你得到了但并不需要。您可以使用文本属性获得不带任何HTML标记的纯文本，如: content_list.text。如果这不是您想要的，请详细说明您的问题:您期望的结果是什么？

import requests
from bs4 import BeautifulSoup, NavigableString

page = requests.get('https://www.theguardian.com/world/2016/jun/30/mexican-woman-117-years-old-dies-birth-certificate')
soup = BeautifulSoup(page.text, 'html.parser')

content_list_items = soup.body.find_all('p')    

for content_list in content_list_items:
    txt = content_list if type(content_list) == NavigableString else content_list.text
    print(txt)

编辑

因此，基于这个解决方案(How to remove content in nested tags with BeautifulSoup?)，您可以迭代子对象并只选择NavigableString类型的子对象。然而，对于您的特定示例，这也将删除锚标签中的链接，例如，句子:一位117岁的城市妇女终于收到了她的出生证明……鉴于最初的判决是墨西哥市一名117岁的妇女终于拿到了她的出生证明...

content_list_items = soup.body.find_all('p')

for content_list in content_list_items:
    for child in content_list.children:
        if type(child) == NavigableString:
            print(child.strip())

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53126188

复制

相似问题

问清除标签bs4
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问清除标签bs4EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问清除标签bs4
EN