文章/答案/技术大牛

发布

社区首页 >问答首页 >提取同级两个标记之间的内容

问提取同级两个标记之间的内容
EN

Stack Overflow用户

提问于 2018-05-21 12:00:50

回答 2查看 641关注 0票数 0

我试图在同一级别收集两个标记之间的内容，在本例中，以下两个h2标记之间的内容：

<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>

理想情况下，我希望输出如下(即，理想情况下，<th>中的文本将被忽略，但我不介意它停留在这里)：

Plan for and be active in your own learning...
Reflect on your knowledge of teaching and yourself...
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

到目前为止，这就是我所拥有的；

soup = BeautifulSoup(text)
output = ""
unitLO = soup.find(id="learning-outcomes")
tagBreak = unitLO.name
if unitLO:
    # we will loop until we hit the next tag with the same name as the
    # matched tag. eg if unitLO matches an H3, then all content up till the
    # next H3 is captured.
    for tag in unitLO.next_siblings:
        if tag.name == tagBreak:
            break
        else:
            output += str(tag)

print(output)

它提供以下输出，这是一个字符串；

>>> type(output)
<class 'str'>
>>>


<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>

这和我想要的有点不同..。

我想出的唯一解决方案是推动output通过另一轮BeautifulSoup解析：

>>> moresoup = BeautifulSoup(output)
>>> for str in moresoup.strings:
...     print(str)
...






On successful completion of this unit, you will beableto:












Plan for and be active in your own learning...


Reflect on your knowledge of yourself....


Articulate your informed understanding of the foundations...


Demonstrate information literacy skills


Communicate in writing for an academic audience










>>>

这是真正的不雅，并导致大量的空白(当然，这是很容易清理)。

有什么更好的方法吗？

非常感谢！

python

beautifulsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-05-21 12:06:02

尝试使用soup.find_all获取所有p标记

Ex:

from bs4 import BeautifulSoup
s = """<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>"""

soup = BeautifulSoup(s, "html.parser")
for p in soup.find(id="learning-outcomes").findNext("table").find_all("p"):
    print(p.text)

输出：

Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

票数 2

Stack Overflow用户

发布于 2018-05-21 12:05:02

更改以下代码

if unitLO:
    # we will loop until we hit the next tag with the same name as the
    # matched tag. eg if unitLO matches an H3, then all content up till the
    # next H3 is captured.
    for tag in unitLO.next_siblings:
        if tag.name == tagBreak:
            break
        else:
            if str(tag).strip() != "":
                output += str(tag)

print(output)

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50448447

复制

相似问题

问提取同级两个标记之间的内容
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提取同级两个标记之间的内容EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问提取同级两个标记之间的内容
EN