首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >提取同级两个标记之间的内容

提取同级两个标记之间的内容
EN

Stack Overflow用户
提问于 2018-05-21 12:00:50
回答 2查看 641关注 0票数 0

我试图在同一级别收集两个标记之间的内容,在本例中,以下两个h2标记之间的内容:

代码语言:javascript
复制
<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>

理想情况下,我希望输出如下(即,理想情况下,<th>中的文本将被忽略,但我不介意它停留在这里):

代码语言:javascript
复制
Plan for and be active in your own learning...
Reflect on your knowledge of teaching and yourself...
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

到目前为止,这就是我所拥有的;

代码语言:javascript
复制
soup = BeautifulSoup(text)
output = ""
unitLO = soup.find(id="learning-outcomes")
tagBreak = unitLO.name
if unitLO:
    # we will loop until we hit the next tag with the same name as the
    # matched tag. eg if unitLO matches an H3, then all content up till the
    # next H3 is captured.
    for tag in unitLO.next_siblings:
        if tag.name == tagBreak:
            break
        else:
            output += str(tag)

print(output)

它提供以下输出,这是一个字符串;

代码语言:javascript
复制
>>> type(output)
<class 'str'>
>>>


<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>

这和我想要的有点不同..。

我想出的唯一解决方案是推动output通过另一轮BeautifulSoup解析:

代码语言:javascript
复制
>>> moresoup = BeautifulSoup(output)
>>> for str in moresoup.strings:
...     print(str)
...






On successful completion of this unit, you will beableto:












Plan for and be active in your own learning...


Reflect on your knowledge of yourself....


Articulate your informed understanding of the foundations...


Demonstrate information literacy skills


Communicate in writing for an academic audience










>>>

这是真正的不雅,并导致大量的空白(当然,这是很容易清理)。

有什么更好的方法吗?

非常感谢!

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-05-21 12:06:02

尝试使用soup.find_all获取所有p标记

Ex:

代码语言:javascript
复制
from bs4 import BeautifulSoup
s = """<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>"""

soup = BeautifulSoup(s, "html.parser")
for p in soup.find(id="learning-outcomes").findNext("table").find_all("p"):
    print(p.text)

输出:

代码语言:javascript
复制
Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience
票数 2
EN

Stack Overflow用户

发布于 2018-05-21 12:05:02

更改以下代码

代码语言:javascript
复制
if unitLO:
    # we will loop until we hit the next tag with the same name as the
    # matched tag. eg if unitLO matches an H3, then all content up till the
    # next H3 is captured.
    for tag in unitLO.next_siblings:
        if tag.name == tagBreak:
            break
        else:
            if str(tag).strip() != "":
                output += str(tag)

print(output)
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50448447

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档