我正在尝试从具有嵌套内容的HTML标记中提取文本内容。我从另一个相关的问题中拿出了这个例子,这个问题可以看到here。
>>> from parsel import Selector
>>> sel = Selector(text='''
<p>
Senator <a href="/people/senator_whats_their_name">What's-their-name</a> is <em>furious</em> about politics!
</p>''')
>>>
>>> # Using XPath
... sel.xpath('normalize-space(//p)').extract_first()
"Senator What's-their-name is furious about politics!"
>>>
>>> # Using CSS
... "".join(sel.css("p *::text").extract())
"Senator What's-their-name is furious about politics!"这和我想要的很接近。但是,我想排除一些特定的标签。例如,我想从结果字符串中排除a标记的内容。我想要:
Senator is furious about politics!
怎样才能达到预期的效果?我倾向于继续使用Scrapy / Parsel来获得结果,但是如果没有解决方案,我可以考虑使用任何其他第三方库。任何帮助都将不胜感激。谢谢!
发布于 2019-11-17 19:16:06
这里是使用beautifulsoup.
''from bs4 import BeautifulSoup as bsp
soup = bsp(''' <p>
Senator <a href="/people/senator_whats_their_name">What's-their-name</a> is <em>furious</em> about politics!
<h1> I want to be ignored</h1>
<h2> I should not be ignored</h2>.
</p>''', 'html.parser')
for tag in soup.find_all(['a', 'h1']): # give the list of tags you want to ignore here.
tag.replace_with('')
print(soup.text)产出:
Senator is furious about politics!
I should not be ignored.上面的代码将从text.
tags,只需更改string(text)并保留标签。for tag in soup.find_all(['a', 'h1']):
tag.string.replace_with('')
print(soup)输出:
<p>
Senator <a href="/people/senator_whats_their_name"></a> is <em>furious</em> about politics!
<h1></h1>
<h2> I should not be ignored</h2>.
</p>https://stackoverflow.com/questions/58904013
复制相似问题