我试图在同一级别收集两个标记之间的内容,在本例中,以下两个h2标记之间的内容:
<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>理想情况下,我希望输出如下(即,理想情况下,<th>中的文本将被忽略,但我不介意它停留在这里):
Plan for and be active in your own learning...
Reflect on your knowledge of teaching and yourself...
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience到目前为止,这就是我所拥有的;
soup = BeautifulSoup(text)
output = ""
unitLO = soup.find(id="learning-outcomes")
tagBreak = unitLO.name
if unitLO:
# we will loop until we hit the next tag with the same name as the
# matched tag. eg if unitLO matches an H3, then all content up till the
# next H3 is captured.
for tag in unitLO.next_siblings:
if tag.name == tagBreak:
break
else:
output += str(tag)
print(output)它提供以下输出,这是一个字符串;
>>> type(output)
<class 'str'>
>>>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>这和我想要的有点不同..。
我想出的唯一解决方案是推动output通过另一轮BeautifulSoup解析:
>>> moresoup = BeautifulSoup(output)
>>> for str in moresoup.strings:
... print(str)
...
On successful completion of this unit, you will beableto:
Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience
>>>这是真正的不雅,并导致大量的空白(当然,这是很容易清理)。
有什么更好的方法吗?
非常感谢!
发布于 2018-05-21 12:06:02
尝试使用soup.find_all获取所有p标记
Ex:
from bs4 import BeautifulSoup
s = """<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>"""
soup = BeautifulSoup(s, "html.parser")
for p in soup.find(id="learning-outcomes").findNext("table").find_all("p"):
print(p.text)输出:
Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience发布于 2018-05-21 12:05:02
更改以下代码
if unitLO:
# we will loop until we hit the next tag with the same name as the
# matched tag. eg if unitLO matches an H3, then all content up till the
# next H3 is captured.
for tag in unitLO.next_siblings:
if tag.name == tagBreak:
break
else:
if str(tag).strip() != "":
output += str(tag)
print(output)https://stackoverflow.com/questions/50448447
复制相似问题