文章/答案/技术大牛

发布

社区首页 >问答首页 >如何迭代页面并获取每篇新闻文章的链接和标题

问如何迭代页面并获取每篇新闻文章的链接和标题
EN

Stack Overflow用户

提问于 2020-05-18 15:09:21

回答 1查看 310关注 0票数 0

我正在从这个站点https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance中抓取10个页面(以及下面的页面)

我预计总共100个链接和标题应该存储在页面链接中。然而，只保存了10个链接和10个标题。

我怎样才能刮掉这10页并存储文章链接/标题呢？

任何帮助都将不胜感激！

def scrape(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    request = 0
    urls = [f"{url}{x}" for x in range(1,11)]
    params = {
       "orderby": "relevance",
    }
    for page in urls:
        response = requests.get(url=page,
                                headers=user_agent,
                                params=params) 
        # controlling the crawl-rate
        start_time = time() 
        #pause the loop
        sleep(randint(8,15))
        #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
        clear_output(wait = True)
    
        #throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        #Break the loop if the number of requests is greater than expected
        if request > 72:
            warn('Number of request was greater than expected.')
            break
        
        
        #parse the content
        soup_page = bs(response.text) 
        #select all the articles for a single page
        containers = soup_page.findAll("li", {'class': 'article'})
        
        #scrape the links of the articles
        pagelinks = []
        for link in containers:
            url = link.find('a')
            pagelinks.append(url.get('href'))
        
    print(pagelinks)


        #scrape the titles of the articles
        title = []
        for link in containers:
            atitle = link.find(class_ = 'entry-heading').find('a')
            thetitle = atitle.get_text()
            title.append(thetitle)

    print(title)

python

loops

web-scraping

beautifulsoup

web-crawler

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-05-18 15:11:02

将pagelinks = []从for page in urls:中删除。

通过将其放入for page in urls:循环中，您将覆盖页面的每一次迭代中的页面链接列表，因此，最后，您只从最后一页得到10个链接。

def scrape(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    request = 0
    urls = [f"{url}{x}" for x in range(1,11)]
    params = {
       "orderby": "relevance",
    }
    pagelinks = []
    title = []
    for page in urls:
        response = requests.get(url=page,
                                headers=user_agent,
                                params=params) 
        # controlling the crawl-rate
        start_time = time() 
        #pause the loop
        sleep(randint(8,15))
        #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
        clear_output(wait = True)

        #throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        #Break the loop if the number of requests is greater than expected
        if request > 72:
            warn('Number of request was greater than expected.')
            break


        #parse the content
        soup_page = bs(response.text) 
        #select all the articles for a single page
        containers = soup_page.findAll("li", {'class': 'article'})

        #scape the links of the articles
        
        for link in containers:
            url = link.find('a')
            pagelinks.append(url.get('href'))

        for link in containers:
            atitle = link.find(class_ = 'entry-heading').find('a')
            thetitle = atitle.get_text()
            title.append(thetitle)
    print(title)
    print(pagelinks)

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61872723

复制

相似问题

问如何迭代页面并获取每篇新闻文章的链接和标题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何迭代页面并获取每篇新闻文章的链接和标题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何迭代页面并获取每篇新闻文章的链接和标题
EN