文章/答案/技术大牛

发布

社区首页 >问答首页 >使用selenium python进行基于条件的抓取

问使用selenium python进行基于条件的抓取
EN

Stack Overflow用户

提问于 2021-08-10 07:10:33

回答 1查看 55关注 0票数 0

我想要抓取6天内的日期和相关的新闻标题/文章-就像今天运行python脚本时，它应该抓取从今天(8月10日)到8月4日的标题/文章。我可以从here抓取所有日期的日期和头条新闻/urls。下面是相同的代码

    websites = ['https://www.thespiritsbusiness.com/tag/rum/']
    for spirits in websites:
        browser.get(spirits)
        time.sleep(1)

        news_links = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/h3')
        n_links = [ele.find_element_by_tag_name('a').get_attribute('href') for ele in news_links]
        dates = browser.find_elements_by_xpath('//*[@id="archivewrapper"]/div/div[2]/small')
        n_dates = [ele.text for ele in dates]
        print(n_links)
        print(n_dates)

但是，从今天开始的最后6天里，我该怎么做呢？有什么想法吗？

selenium

selenium-webdriver

web-scraping

python

回答 1

Stack Overflow用户

发布于 2021-08-10 07:15:28

请参阅第2页的url

https://www.thespiritsbusiness.com/tag/rum/page/2/

这基本上意味着，对于下一次迭代，您需要在URL中添加/page/2/。

您可以将网站列表设置为：

websites = ['https://www.thespiritsbusiness.com/tag/rum/', 'https://www.thespiritsbusiness.com/tag/rum/page/2/', 'https://www.thespiritsbusiness.com/tag/rum/page/3/']

以此类推，来实现这一点。

或者，您也可以通过编程来完成此操作：

page_number = 1
websites = ['https://www.thespiritsbusiness.com/tag/rum/']
for spirits in websites:
        browser.get(spirits + f"page/{page_number}/")
        page_number = page_number + 1

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68722400

复制

相似问题

问使用selenium python进行基于条件的抓取
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用selenium python进行基于条件的抓取EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用selenium python进行基于条件的抓取
EN