文章/答案/技术大牛

发布

社区首页 >问答首页 >用BeautifulSoup将一个HTML文档切割/切片成块？

问用BeautifulSoup将一个HTML文档切割/切片成块？
EN

Stack Overflow用户

提问于 2016-03-23 21:52:09

回答 2查看 3.7K关注 0票数 2

我有一个HTML文档如下：

<h1> Name of Article </h2> 
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>
<h2> References </h2> 
<p>Html I do not want...</p>

我不需要文章中的引用，我想在第二个h2标记上对文档进行切片。

显然，我可以找到如下所示的h2标记列表：

soup = BeautifulSoup(html)
soupset = soup.find_all('h2')
soupset[1] #this would get the h2 heading 'References' but not what comes before it

我不想得到h2标签的列表，我想在第二个h2标记处将文档切片，并将上面的内容保存在一个新变量中。基本上，我想要的输出是：

<h1> Name of Article </h2> 
<p>First Paragraph I want<p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want to pull out of the document.</p>

什么是最好的方法来做这个“切片”的/cutting文档，而不是简单地找到标签和输出标签本身？

html-parsing

python

html

beautifulsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2016-03-23 21:59:55

您可以移除/提取“引用”元素的每个同级元素以及元素本身：

import re
from bs4 import BeautifulSoup

data = """
<div>
    <h1> Name of Article </h2>
    <p>First Paragraph I want</p>
    <p>More Html I'm interested in</p>
    <h2> Subheading in the article I also want </h2>
    <p>Even more Html i want to pull out of the document.</p>
    <h2> References </h2>
    <p>Html I do not want...</p>
</div>
"""
soup = BeautifulSoup(data, "lxml")

references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
    elm.extract()
references.extract()

print(soup)

指纹：

<div>
    <h1> Name of Article</h1>
    <p>First Paragraph I want</p>
    <p>More Html I'm interested in</p>
    <h2> Subheading in the article I also want </h2>
    <p>Even more Html i want to pull out of the document.</p>
</div>

票数 1

Stack Overflow用户

发布于 2016-03-23 22:21:29

您可以找到h2在字符串中的位置，然后通过它找到一个子字符串：

last_h2_tag = str(soup.find_all("h2")[-1]) 
html[:html.rfind(last_h2_tag) + len(last_h2_tag)]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/36189381

复制

相似问题

问用BeautifulSoup将一个HTML文档切割/切片成块？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用BeautifulSoup将一个HTML文档切割/切片成块？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用BeautifulSoup将一个HTML文档切割/切片成块？
EN