文章/答案/技术大牛

发布

社区首页 >问答首页 >从嵌套的HTML中提取文本内容，同时排除一些特定的标记；

问从嵌套的HTML中提取文本内容，同时排除一些特定的标记；
EN

Stack Overflow用户

提问于 2019-11-17 18:52:20

回答 1查看 645关注 0票数 1

我正在尝试从具有嵌套内容的HTML标记中提取文本内容。我从另一个相关的问题中拿出了这个例子，这个问题可以看到here。

>>> from parsel import Selector
>>> sel = Selector(text='''
    <p>
        Senator <a href="/people/senator_whats_their_name">What&#39s-their-name</a> is <em>furious</em> about politics!
    </p>''')
>>>
>>> # Using XPath
... sel.xpath('normalize-space(//p)').extract_first()
"Senator What's-their-name is furious about politics!"
>>>
>>> # Using CSS
... "".join(sel.css("p *::text").extract())
"Senator What's-their-name is furious about politics!"

这和我想要的很接近。但是，我想排除一些特定的标签。例如，我想从结果字符串中排除a标记的内容。我想要：

Senator is furious about politics!

怎样才能达到预期的效果？我倾向于继续使用Scrapy / Parsel来获得结果，但是如果没有解决方案，我可以考虑使用任何其他第三方库。任何帮助都将不胜感激。谢谢!

beautifulsoup

scrapy

python

css

xpath

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-11-17 19:16:06

这里是使用beautifulsoup.

you的
工作解决方案，可以在scrapy或parsel中找到类似的函数，并使用类似的方法。
的思想是将您想忽略的标记的内容设置为''
Here，这是一个示例。

from bs4 import BeautifulSoup as bsp

soup = bsp(''' <p>
        Senator <a href="/people/senator_whats_their_name">What&#39s-their-name</a> is <em>furious</em> about politics!
        <h1> I want to be ignored</h1>
        <h2> I should not be ignored</h2>.
    </p>''', 'html.parser')

for tag in soup.find_all(['a', 'h1']): # give the list of tags you want to ignore here.
    tag.replace_with('')

print(soup.text)

产出：

  Senator  is furious about politics!

 I should not be ignored.

上面的代码将从text.

Following函数中删除您想要忽略的所有tags，只需更改string(text)并保留标签。

for tag in soup.find_all(['a', 'h1']):
    tag.string.replace_with('')
print(soup)

输出：

 <p>
        Senator <a href="/people/senator_whats_their_name"></a> is <em>furious</em> about politics!
        <h1></h1>
<h2> I should not be ignored</h2>.
    </p>

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/58904013

复制

相似问题

问从嵌套的HTML中提取文本内容，同时排除一些特定的标记；
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从嵌套的HTML中提取文本内容，同时排除一些特定的标记；EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从嵌套的HTML中提取文本内容，同时排除一些特定的标记；
EN