文章/答案/技术大牛

发布

社区首页 >问答首页 >捕捉一个异常并将一些价值归因于熊猫应用()功能？

问捕捉一个异常并将一些价值归因于熊猫应用()功能？
EN

Stack Overflow用户

提问于 2016-11-28 18:11:44

回答 1查看 907关注 0票数 0

我将一个从urls中提取文本的过程包装成一个函数：

def text(link):
    article = Article(link)
    article.download()
    article =  article.parse()
    return article

我计划将此功能应用于熊猫专栏：

df['text'] = df['links'].apply(text)

然而，links列的一些链接被打破(即HTTPError: HTTP Error 404: Not Found)。因此，我的问题是，如何将NaN添加到已损坏的urls中，并传递它们？我试着做：

from newspaper import Article
import numpy as np
import requests

def text(link):
    article = Article(link)
    try:
        article.download()
        article = article.parse()
    except requests.exceptions.HTTPError:
        return np.nan
    return article

df['text'] = df['links'].apply(text)

不过，我不知道是否可以处理apply()函数，以便将NaN值归并到其链接中断的单元格中。

更新

我尝试用ArticleException来处理它，如下所示：

df：

title   Link
Inside tiny tubes, water turns solid when it should be boiling  http://news.mit.edu/2016/carbon-nanotubes-water-solid-boiling-1128
Four MIT students named 2017 Marshall Scholars  http://news.mit.edu/2016/four-mit-students-marshall-scholars-11282
Saharan dust in the wind    http://news.mit.edu/2016/saharan-dust-monsoons-11231
The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-graphene-surfaces-1123

在：

import numpy as np
from newspaper import Article, ArticleException
import requests

def text_extractor2(link):
    article = Article(link)
    try:
        article.download()
    except ArticleException:
        article = article.parse()
        return np.nan
    return article

df['text'] = df['Link'].apply(text_extractor2)
df

退出：

    title   Link    text
0   Inside tiny tubes, water turns solid when it s...   http://news.mit.edu/2016/carbon-nanotubes-wate...   <newspaper.article.Article object at 0x10c8a0320>
1   Four MIT students named 2017 Marshall Scholars  http://news.mit.edu/2016/four-mit-students-mar...   <newspaper.article.Article object at 0x1070df0f0>
2   Saharan dust in the wind    http://news.mit.edu/2016/saharan-dust-monsoons...   <newspaper.article.Article object at 0x107b035c0>
3   The science of friction on graphene     http://news.mit.edu/2016/sliding-flexible-grap...   <newspaper.article.Article object at 0x10c8bf8d0>

python

python-3.x

pandas

exception-handling

python-requests

回答 1

Stack Overflow用户

发布于 2016-11-28 18:31:46

根据我的理解，您希望与断开链接对应的行在text列中有一个text值。如果您还没有添加numpy导入，我们可以首先添加：

import numpy as np

我假设抛出的异常是HTTPError，并将使用NumPy作为其NaN值：

def text(link):
    article = Article(link)

    try:
        article.download()
    except HTTPError:
        return np.nan

    article = article.parse()
    return article

然后，用熊猫应用，

df['text'] = df['links'].apply(text)

文本列应该包含断开链接的缺失值和有效链接的文章文本。

在不使用newspaper的情况下，您可以修改函数以捕获ur.urlopen(url).read()上的异常。

def text_extractor(url):
    try:
        html = ur.urlopen(url).read()
    except ur.HTTPError:
        return np.nan

    soup = BeautifulSoup(html, 'lxml')
    for script in soup(["script", "style"]):
        script.extract()
        text = soup.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = ' '.join(chunk for chunk in chunks if chunk)
    sentences = ', '.join(sent_tokenize(str(text.strip('\'"') )))
    return sentences

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40850841

复制

相似问题

问捕捉一个异常并将一些价值归因于熊猫应用()功能？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问捕捉一个异常并将一些价值归因于熊猫应用()功能？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问捕捉一个异常并将一些价值归因于熊猫应用()功能？
EN