首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何在脱机文件中使用newspaper3k python

如何在脱机文件中使用newspaper3k python
EN

Stack Overflow用户
提问于 2022-10-26 07:21:12
回答 1查看 33关注 0票数 1

我需要从html文件中获取文章/新闻,我找到的最好的解决方案是在python中使用newspaper3k。我得到了一个空白的结果,我尝试了很多的解决方案,但我是一个被困在这里。

代码语言:javascript
复制
from newspaper import Article
with open("index.html", 'r', encoding='utf-8') as f:
    article = Article('', language='en')
    article.download(input_html=f.read())
    article.parse()
    print(article.title)

结果:“”

应该从html文件中的文章标记中打印文本。

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-10-26 11:56:57

你的代码看上去是对的。

我假设问题是你的来源。index.html中有什么?您能给我提供这个文件或从它提取的URL吗?

顺便说一下,这里是用newspaper3k__读取脱机内容的代码示例。这个示例来自我的https://github.com/johnbumgarner/newspaper3_usage_overview关于使用newspaper3k__。

代码语言:javascript
复制
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.cnn.com/2020/10/12/health/johnson-coronavirus-vaccine-pause-bn/index.html'
article = Article(base_url, config=config)
article.download()
article.parse()
with open('cnn.html', 'w') as fileout:
    fileout.write(article.html)


# Read the HTML file created above
with open("cnn.html", 'r') as f:
    # note the empty URL string
    article = Article('', language='en')
    article.download(input_html=f.read())
    article.parse()
    
    print(article.title)
    Johnson & Johnson pauses Covid-19 vaccine trial after 'unexplained illness'
    
    article_meta_data = article.meta_data
    
    article_published_date = {value for (key, value) in article_meta_data.items() if key == 'pubdate'}
    print(article_published_date)
    {'2020-10-13T01:31:25Z'}

    article_author = {value for (key, value) in article_meta_data.items() if key == 'author'}
    print(article_author)
    {'Maggie Fox, CNN'}

    article_summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
    print(article_summary)
    {'Johnson&Johnson said its Janssen arm had paused its coronavirus vaccine trial  after an "unexplained illness" in one 
    of the volunteers testing its experimental Covid-19 shot.'}

    article_keywords = {value for (key, value) in article_meta_data.items() if key == 'keywords'}
    print(article_keywords)
    {"health, Johnson & Johnson pauses Covid-19 vaccine trial after 'unexplained illness' - CNN"}
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/74204036

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档