文章/答案/技术大牛

发布

社区首页 >问答首页 >如何不用下载文章就能使用Newspaper3k库？

问如何不用下载文章就能使用Newspaper3k库？
EN

Stack Overflow用户

提问于 2019-06-20 08:50:43

回答 3查看 2.7K关注 0票数 7

假设我有新闻文章的本地副本。我怎样才能在报纸上刊登这些文章呢？根据文档，报纸图书馆的正常使用如下所示：

from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article.download()
article = Article(url)
article.parse()
# ...

在我的例子中，我不需要从web页面下载文章，因为我已经有了该页面的本地副本。如何在网页的本地副本上使用报纸？

python-newspaper

python

回答 3

Stack Overflow用户

发布于 2019-06-20 09:23:49

你可以，只是有点老生常谈。举个例子

import requests
from newspaper import Article

url = 'https://www.cnn.com/2019/06/19/india/chennai-water-crisis-intl-hnk/index.html'

# get sample html
r = requests.get(url)

# save to file
with open('file.html', 'wb') as fh:
    fh.write(r.content)

a = Article(url)

# set html manually
with open("file.html", 'rb') as fh:
    a.html = fh.read()

# need to set download_state to 2 for this to work
a.download_state = 2

a.parse()

# Now the article should be populated
a.text

# 'New Delhi (CNN) The floor...'

newspaper.article.py中的代码片段中的download_state来自何处

# /path/to/site-packages/newspaper/article.py
class ArticleDownloadState(object):
    NOT_STARTED = 0
    FAILED_RESPONSE = 1
    SUCCESS = 2

~snip~

# This is why you need to set that variable
class Article:
    def __init__(...):
        ~snip~
         # Keep state for downloads and parsing
        self.is_parsed = False
        self.download_state = ArticleDownloadState.NOT_STARTED
        self.download_exception_msg = None

    def parse(self):
        # will throw exception if download_state isn't 2
        self.throw_if_not_downloaded_verbose()

        self.doc = self.config.get_parser().fromstring(self.html)

作为另一种选择，您可以覆盖该类，使其与parse函数的行为相同：

from newspaper import Article
import io

class localArticle(Article):
    def __init__(self, url, **kwargs):
        # set url to be file_name in __init__ if it's a file handle
        super().__init__(url if isinstance(url, str) else url.name, **kwargs)
        # set standalone _url attr so that parse will work as expected
        self._url = url

    def parse(self):

        # sets html and things for you
        if isinstance(self._url, str):
            with open(self._url, 'rb') as fh:
                self.html = fh.read()

        elif isinstance(self._url, (io.TextIOWrapper, io.BufferedReader)):
            self.html = self._url.read()

        else:
            raise TypeError(f"Expected file path or file-like object, got {self._url.__class__}")

        self.download_state = 2
        # now parse will continue on with the proper params set
        super(localArticle, self).parse()


a = localArticle('file.html') # pass your file name here
a.parse()

a.text[:10]
# 'New Delhi '

# or you can give it a file handle
with open("file.html", 'rb') as fh:
    a = localArticle(fh)
    a.parse()

a.text[:10]
# 'New Delhi '

票数 6

Stack Overflow用户

发布于 2020-09-08 22:25:07

正如前面提到的here，确实有一种官方的方法来解决这个问题

将html加载到程序中后，可以使用set_html()方法将其设置为article.html

import newspaper
with open("file.html", 'rb') as fh:
    ht = fh.read()
article = newspaper.Article(url = ' ')
article.set_html(ht)
article.parse()

票数 5

Stack Overflow用户

发布于 2020-10-13 10:56:16

我相信你已经解决了这个问题，但是报纸有能力处理本地存储的HTML文件。

from newspaper import Article

# Downloading the HTML for the article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.parse()
with open('fox13no.html', 'w') as fileout:
   fileout.write(article.html)

# Read the locally stored HTML with Newspaper
with open("fox13no.html", 'r') as f:
   # note the URL string is empty
   article = Article('', language='en')
   article.download(input_html=f.read())
   article.parse()
   print(article.title) 
   New Year, new laws: Obamacare, pot, guns and drones

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/56677636

复制

相似问题

问如何不用下载文章就能使用Newspaper3k库？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何不用下载文章就能使用Newspaper3k库？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何不用下载文章就能使用Newspaper3k库？
EN