首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >找不到publish_date和newspaper3k

找不到publish_date和newspaper3k
EN

Stack Overflow用户
提问于 2022-10-20 15:29:34
回答 1查看 53关注 0票数 0

我想从报纸图书馆(newspaper3k)的网站上抓取一篇文章。然而,它找不到文章的published_date,它在网站的源文本中是div.source-date,而作者(或者源的)是div.delfi--源名在网站的源文本中。我怎样才能刮起日期和作者/来源?

网站/网址示例:https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501

我的代码:

代码语言:javascript
复制
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

article = Article("url")
article.download()
article.parse()
article.nlp()

df = pd.DataFrame([{'Title':article.title, 'Author':article.authors, 'Text':article.text,
                    'published_date':article.publish_date, 'Source':article.source_url}])

df.to_excel('Delfi-1.xlsx')

有什么建议吗?

EN

回答 1

Stack Overflow用户

发布于 2022-10-22 12:26:20

源中的日期元素位于两个位置。您看到的Wednesday, October 19, 2022位于一个newspaper3k不使用BeautifulSoup就无法解析的div标记中。

第二个日期隐藏在元标记中,newspaper3k可以用一些额外的代码来解析这些标记。

代码语言:javascript
复制
from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501'
try:
    article = Article(base_url, config=config)
    article.download()
    article.parse()
    article_meta_data = article.meta_data

    article_title = [value['title'] for (key, value) in article_meta_data.items() if key == 'og']
    print(article_title)

    article_published_date = [value['recs']['publishtime'] for key, value in article_meta_data.items()
                              if key == 'cXenseParse']
    print(article_published_date)

    article_description = [value['description'] for (key, value) in article_meta_data.items() if key == 'og']
    print(article_description)

except ArticleException as error:
    print(error)

输出

代码语言:javascript
复制
["Foreign Ministry: Tsikhanouskaya's consultation needed for treating Belarusians in Lithuania"]
['2022-10-19T11:38:07+0300']
["As Belorus, a Belarus-owned sanatorium in Lithuania's southern resort of Druskininkai, complaints over the fact that Lithuania fails to issue visas to Belarusian citizens, forcing the sanatorium to fire a quarter of its staff, Lithuania's Foreign Ministry suggests coordinating the list of arrivals with Belarusian opposition leaders Sviatlana Tsikhanouskaya's office in Vilnius."]

P.S. Newspaper3k有多种方法从文章中提取发布日期。看看我写的关于如何使用文档的这个Newspaper3k。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74142518

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档