问使用Python和BeautifulSoup，只选择未包装在<a>中的文本节点。
EN

Stack Overflow用户

提问于 2015-10-03 19:07:41

回答 1查看 1.1K关注 0票数 2

我试图解析一些文本sot帽子，我可以urlize (用标签包装)的链接，没有格式化。下面是一些示例文本：

text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'

这是我离这里很远的地方

from django.utils.html import urlize
from bs4 import BeautifulSoup

...

def urlize_html(text):

    soup = BeautifulSoup(text, "html.parser")

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        urlizedText = urlize(textNode)
        textNode.replaceWith(urlizedText)

    return = str(soup)

但是，这也会捕获示例中的中间链接，导致它被双重包装在<a>标记中。结果是这样的：

<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank">&lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</a>, and this is a link too but not formatted: &lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</p>

我能对textNodes = soup.findAll(text=True)做些什么，使它只包含尚未包装在<a>标记中的文本节点？

python

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-10-03 19:09:53

Text节点保留它们的parent引用，因此您只需测试a标记：

for textNode in textNodes:
    if textNode.parent and getattr(textNode.parent, 'name') == 'a':
        continue  # skip links
    urlizedText = urlize(textNode)
    textNode.replaceWith(urlizedText)

票数 5

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/32926395

复制

相似问题

问使用Python和BeautifulSoup，只选择未包装在<a>中的文本节点。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python和BeautifulSoup，只选择未包装在<a>中的文本节点。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python和BeautifulSoup，只选择未包装在<a>中的文本节点。
EN