首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用BeautifulSoup提取<p></p>的文本

使用BeautifulSoup提取<p></p>的文本
EN

Stack Overflow用户
提问于 2017-08-16 01:46:42
回答 1查看 92关注 0票数 0

我正在尝试从this链接获取新闻文章。我的代码是:

代码语言:javascript
复制
def get_news_details(news_url):
    source = requests.get(news_url)
    plain_text = source.text
    soup = BeautifulSoup(plain_text, "html.parser")
    content = soup.findAll('div', {'class' : 'big-img-box'})
    print(content[0].findAll('p'))

结果表明:

代码语言:javascript
复制
[<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>]

content的值是:

代码语言:javascript
复制
<div class="big-img-box">
<div class="left-imgs">
<figure>
<img alt="iOS developer hints possibility of 4K Apple TV" class="img-responsive" src="http://www.aninews.in/contentimages/detail/appletv.jpg"/>
<figcaption><span class="heading-inner-span"></span></figcaption>
</figure>
<div class="mb10"></div>
</div>
<p></p>      New York [USA], August 6 <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a>: The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/4k-apple-tv.html"> 4K Apple TV</a></span> with high dynamic range (HDR)  support for both <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr10.html"> HDR10  </a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/dolby-vision.html"> Dolby Vision</a></span>.<p></p>      While the current range of Apple's TV set-top box is incompatible to 4K technology, <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/ios.html">iOS</a></span> developer <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/guilherme-rambo.html"> Guilherme Rambo</a></span> revealed that the company is hinting an adoption of the ultra high-definition format, reports <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/the-verge.html">The Verge</a></span>.<p></p>      Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.<p></p>      It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/netflix.html"> Netflix</a></span> and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/amazon.html"> Amazon</a></span> support the two high-definition formats.<p></p>      Last month, iTunes started listing movies as supporting 4K and <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/hdr.html"> HDR</a></span> in users' purchase histories, thus providing more thrust to the speculations of the 4K <span class="highlights"><a href="http://aninews.in/keysearch/keyword-search/apple.html"> Apple</a></span> TV. <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a><p></p>
</div>

我可以从content[0].text获得这篇文章的一个有点笨拙的版本,但我不能格式化它。

在用chrome查看网页时,文章似乎是写在<p>article_text</p>标记中的。而在content中,它显示为<p></p>article_text标记。如果以前的版本在soup中,我就可以得到我想要的输出。应该做些什么?

EN

回答 1

Stack Overflow用户

发布于 2017-08-16 02:22:48

这取决于你所谓的格式化是什么意思。你可以用相当简单的方式让它变得更“整洁”。

代码语言:javascript
复制
>>> import bs4
>>> import requests
>>> page = requests.get('http://www.aninews.in/newsdetail-Nw/MzI4NDIy/ios-developer-hints-possibility-of-4k-apple-tv.html').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> big_img_box = soup.select('.big-img-box')

获取所有文本并去掉空格。

代码语言:javascript
复制
>>> big_img_box[0].text.strip()
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a  4K Apple TV with high dynamic range (HDR)  support for both  HDR10   and  Dolby Vision.      While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer  Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge.      Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.      It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like  Netflix and  Amazon support the two high-definition formats.      Last month, iTunes started listing movies as supporting 4K and  HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K  Apple TV. (ANI)"

在此基础上,删除较长的内部空白字符串。

代码语言:javascript
复制
>>> import re
>>> re.sub(r'\s{2,}', ' ', big_img_box[0].text.strip())
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a 4K Apple TV with high dynamic range (HDR) support for both HDR10 and Dolby Vision. While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge. Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year. It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like Netflix and Amazon support the two high-definition formats. Last month, iTunes started listing movies as supporting 4K and HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K Apple TV. (ANI)"
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/45698505

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档