文章/答案/技术大牛

发布

社区首页 >问答首页 >用Newspaper3k进行网络抓取，只有50篇文章

问用Newspaper3k进行网络抓取，只有50篇文章
EN

Stack Overflow用户

提问于 2020-09-07 18:47:38

回答 1查看 1.6K关注 0票数 2

我想刮数据在一个法语网站与newspaper3k和结果将只有50篇文章。这个网站有50多篇文章。我哪里错了？

我的目标是刮掉这个网站上所有的文章。

我试过这个：

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr/', memoize_articles=False)

# Empty list to put all urls
papers = []

for article in legorafi_paper.articles:
    papers.append(article.url)

print(legorafi_paper.size())

这次印刷的结果是50篇文章。

我不明白为什么newspaper3k只会在上刮 50篇文章，而不会有更多文章。

我试过的最新情况：

def Foo(firstTime = []):
    if firstTime == []:
        WebDriverWait(driver, 30).until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR,"div#appconsent>iframe")))
        firstTime.append('Not Empty')
    else:
        print('Cookies already accepted')


%%time


categories = ['societe', 'politique']


import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

import newspaper
import requests
from newspaper.utils import BeautifulSoup
from newspaper import Article

categories = ['people', 'sports']
papers = []


driver = webdriver.Chrome(executable_path="/Users/name/Downloads/chromedriver 4")
driver.get('http://www.legorafi.fr/')


for category in categories:
    url = 'http://www.legorafi.fr/category/' + category
    #WebDriverWait(self.driver, 10)
    driver.get(url)
    Foo()
    WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.button--filled>span.baseText"))).click()

    pagesToGet = 2

    title = []
    content = []
    for page in range(1, pagesToGet+1):
        print('Processing page :', page)
        #url = 'http://www.legorafi.fr/category/france/politique/page/'+str(page)
        print(driver.current_url)
        #print(url)

        time.sleep(3)

        raw_html = requests.get(url)
        soup = BeautifulSoup(raw_html.text, 'html.parser')
        for articles_tags in soup.findAll('div', {'class': 'articles'}):
            for article_href in articles_tags.find_all('a', href=True):
                if not str(article_href['href']).endswith('#commentaires'):
                    urls_set.add(article_href['href'])
                    papers.append(article_href['href'])


        for url in papers:
            article = Article(url)
            article.download()
            article.parse()
            if article.title not in title:
                title.append(article.title)
            if article.text not in content:
                content.append(article.text)
            #print(article.title,article.text)

        time.sleep(5)
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        driver.find_element_by_xpath("//a[contains(text(),'Suivant')]").click()
        time.sleep(10)

python

newspaper3k

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-09-19 20:19:54

更新09-21-2020年

我重新检查了您的代码，代码运行正常，因为它正在提取勒戈拉菲主页上的所有文章。这一页上的文章是类别页面中的高亮部分，如兴业、体育等。

下面的示例来自主页的源代码。这些文章中的每一篇也被列在类别页面上--体育。

<div class="cat sports">
    <a href="http://www.legorafi.fr/category/sports/">
       <h4>Sports</h4>
          <ul>
              <li>
                 <a href="http://www.legorafi.fr/2020/07/24/chaque-annee-25-des-lutteurs-doivent-etre-operes-pour-defaire-les-noeuds-avec-leur-bras/" title="Voir l'article 'Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras'">
                  Chaque année, 25% des lutteurs doivent être opérés pour défaire les nœuds avec leur bras</a>
              </li>
               <li>
                <a href="http://www.legorafi.fr/2020/07/09/frank-mccourt-lom-nest-pas-a-vendre-sauf-contre-beaucoup-dargent/" title="Voir l'article 'Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent »'">
                  Frank McCourt « L'OM n'est pas à vendre sauf contre beaucoup d'argent </a>
              </li>
              <li>
                <a href="http://www.legorafi.fr/2020/06/10/euphorique-un-parieur-appelle-son-fils-betclic/" title="Voir l'article 'Euphorique, un parieur appelle son fils Betclic'">
                  Euphorique, un parieur appelle son fils Betclic                 </a>
              </li>
           </ul>
               <img src="http://www.legorafi.fr/wp-content/uploads/2015/08/rubrique_sport1-300x165.jpg"></a>
        </div>
              </div>

主页面上似乎有35条独特的文章条目。

import newspaper

legorafi_paper = newspaper.build('http://www.legorafi.fr', memoize_articles=False)

papers = []
urls_set = set()
for article in legorafi_paper.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     # remove all links for article commentaires
     if not str(article.url).endswith('#commentaires'):
        papers.append(article.url)

 print(len(papers)) 
 # output
 35

如果我将上面代码中的URL更改为：http://www.legorafi.fr/category/sports，它将返回与http://www.legorafi.fr相同数量的文章。在查看了GitHub上报纸的源代码之后，该模块似乎使用了乌尔帕斯，它似乎使用的是the解析的netloc部分。netloc是www.legorafi.fr。我注意到这是一个已知的基于这个开放的问题。的报纸问题。

为了获得所有文章，它变得更加复杂，因为您必须使用一些额外的模块，包括请求和BeautifulSoup。后者可以从报纸上调用。可以对下面的代码进行改进，以便使用请求和BeautifulSoup在主页和类别页上获得源代码中的所有文章。

import newspaper
import requests
from newspaper.utils import BeautifulSoup

papers = []
urls_set = set()

legorafi_paper = newspaper.build('http://www.legorafi.fr', 
fetch_images=False, memoize_articles=False)
for article in legorafi_paper.articles:
   if article.url not in urls_set:
     urls_set.add(article.url)
     if not str(article.url).endswith('#commentaires'):
       papers.append(article.url)

 
legorafi_urls = {'monde-libre': 'http://www.legorafi.fr/category/monde-libre',
             'politique': 'http://www.legorafi.fr/category/france/politique',
             'societe': 'http://www.legorafi.fr/category/france/societe',
             'economie': 'http://www.legorafi.fr/category/france/economie',
             'culture': 'http://www.legorafi.fr/category/culture',
             'people': 'http://www.legorafi.fr/category/people',
             'sports': 'http://www.legorafi.fr/category/sports',
             'hi-tech': 'http://www.legorafi.fr/category/hi-tech',
             'sciences': 'http://www.legorafi.fr/category/sciences',
             'ledito': 'http://www.legorafi.fr/category/ledito/'
             }


for category, url in legorafi_urls.items():
   raw_html = requests.get(url)
   soup = BeautifulSoup(raw_html.text, 'html.parser')
   for articles_tags in soup.findAll('div', {'class': 'articles'}):
      for article_href in articles_tags.find_all('a', href=True):
         if not str(article_href['href']).endswith('#commentaires'):
           urls_set.add(article_href['href'])
           papers.append(article_href['href'])

   print(len(papers))
   # output
   155

如果您需要获取类别页面的子页面中列出的文章( listed目前有120个子页面)，那么您必须使用Selenium之类的方法来单击链接。

希望这段代码能帮助您更接近于实现目标。

票数 4

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63782791

复制

相似问题

问用Newspaper3k进行网络抓取，只有50篇文章
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Newspaper3k进行网络抓取，只有50篇文章EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Newspaper3k进行网络抓取，只有50篇文章
EN