我将一个从urls中提取文本的过程包装成一个函数:
def text(link):
article = Article(link)
article.download()
article = article.parse()
return article我计划将此功能应用于熊猫专栏:
df['text'] = df['links'].apply(text)然而,links列的一些链接被打破(即HTTPError: HTTP Error 404: Not Found)。因此,我的问题是,如何将NaN添加到已损坏的urls中,并传递它们?我试着做:
from newspaper import Article
import numpy as np
import requests
def text(link):
article = Article(link)
try:
article.download()
article = article.parse()
except requests.exceptions.HTTPError:
return np.nan
return article
df['text'] = df['links'].apply(text)不过,我不知道是否可以处理apply()函数,以便将NaN值归并到其链接中断的单元格中。
更新
我尝试用ArticleException来处理它,如下所示:
df:
title Link
Inside tiny tubes, water turns solid when it should be boiling http://news.mit.edu/2016/carbon-nanotubes-water-solid-boiling-1128
Four MIT students named 2017 Marshall Scholars http://news.mit.edu/2016/four-mit-students-marshall-scholars-11282
Saharan dust in the wind http://news.mit.edu/2016/saharan-dust-monsoons-11231
The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-graphene-surfaces-1123在:
import numpy as np
from newspaper import Article, ArticleException
import requests
def text_extractor2(link):
article = Article(link)
try:
article.download()
except ArticleException:
article = article.parse()
return np.nan
return article
df['text'] = df['Link'].apply(text_extractor2)
df退出:
title Link text
0 Inside tiny tubes, water turns solid when it s... http://news.mit.edu/2016/carbon-nanotubes-wate... <newspaper.article.Article object at 0x10c8a0320>
1 Four MIT students named 2017 Marshall Scholars http://news.mit.edu/2016/four-mit-students-mar... <newspaper.article.Article object at 0x1070df0f0>
2 Saharan dust in the wind http://news.mit.edu/2016/saharan-dust-monsoons... <newspaper.article.Article object at 0x107b035c0>
3 The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-grap... <newspaper.article.Article object at 0x10c8bf8d0>发布于 2016-11-28 18:31:46
根据我的理解,您希望与断开链接对应的行在text列中有一个text值。如果您还没有添加numpy导入,我们可以首先添加:
import numpy as np我假设抛出的异常是HTTPError,并将使用NumPy作为其NaN值:
def text(link):
article = Article(link)
try:
article.download()
except HTTPError:
return np.nan
article = article.parse()
return article然后,用熊猫应用,
df['text'] = df['links'].apply(text)文本列应该包含断开链接的缺失值和有效链接的文章文本。
在不使用newspaper的情况下,您可以修改函数以捕获ur.urlopen(url).read()上的异常。
def text_extractor(url):
try:
html = ur.urlopen(url).read()
except ur.HTTPError:
return np.nan
soup = BeautifulSoup(html, 'lxml')
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = ' '.join(chunk for chunk in chunks if chunk)
sentences = ', '.join(sent_tokenize(str(text.strip('\'"') )))
return sentenceshttps://stackoverflow.com/questions/40850841
复制相似问题