文章/答案/技术大牛

发布

社区首页 >问答首页 >漂亮汤:在两个不同的标记(</h3>和<h2>)之间获取文本(包括html标记)

问漂亮汤:在两个不同的标记(</h3>和<h2>)之间获取文本(包括html标记)
EN

Stack Overflow用户

提问于 2020-11-10 19:53:55

回答 1查看 95关注 0票数 0

我正在尝试刮一个html文件，如下所示，使用漂亮汤。基本原则上，每个单位的组成如下：

<h2></h2>

<h3></h3>

多个<p></p>

如下所示：

<h2>January, 2020</h2>
<h3>facility</h3>
<p>text1-1</p>
<p>text1-2</p>

<h2>April, 2020</h2>
<h3>scientists</h3>
<p>text2-1</p>
<p>text2-2</p>

<h2>June, 2020</h2>
<h3>lawyers</h3>
<p>text3-1</p>

<h2>.....

我想获得文本，包括<p>标记在</h3>和下一个<h2>之间。其结果应该是：

第1行：

<p>text1-1</p>
<p>text1-2</p>

第2行：

<p>text2-1</p>
<p>text2-2</p>

第3行：

<p>text3-1</p>

以下是我迄今所做的尝试：

num_h2 = len(soup.find_all('h2'))

for i in range(0,num_h2):
    print('---------')
    print(i) 

    p_string = ''
    sibling = soup.find_all('h3')[i].find_next_sibling('p').getText()

    if sibling:
        p_string += sibling
    else:
        break

    print(p_string)

这个解决方案的问题是，它只显示每个单元下的第一个<p>的内容。我不知道如何找到有多少<p>来生成一个for循环。另外，是否有比使用find_next_silibing()更好的方法来做到这一点？

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-11-10 21:23:17

也许css选择器可以帮助：

for s in soup.select('h3'):
    for ns in (s.fetchNextSiblings()):        
        if ns.name == "h2":
            break
        else:
            if ns.name == "p":
                print(ns)

输出：

<p>text1-1</p>
<p>text1-2</p>
<p>text2-1</p>
<p>text2-2</p>
<p>text3-1</p>

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64775864

复制

相似问题

问漂亮汤:在两个不同的标记(</h3>和<h2>)之间获取文本(包括html标记)
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问漂亮汤:在两个不同的标记(</h3>和<h2>)之间获取文本(包括html标记)EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问漂亮汤:在两个不同的标记(</h3>和<h2>)之间获取文本(包括html标记)
EN