文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Beautiful Soup在Web上浏览链接

问使用Beautiful Soup在Web上浏览链接
EN

Stack Overflow用户

提问于 2019-08-11 06:53:22

回答 2查看 294关注 0票数 1

我正在尝试抓取一个博客"https://blog.feedspot.com/ai_rss_feeds/“，并抓取其中的所有链接，以在每个抓取的链接中查找与人工智能相关的信息。

博客遵循一个模式-它有多个RSS提要，每个提要在UI中都有一个称为"Site“的属性。我需要获取"Site“属性中的所有链接。例如: aitrends.com，sciecedaily.com/...在代码中，主div有一个叫做"rss-block“的类，它有另一个叫做" data”的嵌套类，每个数据都有几个

标记和

标签中有。href中的值提供了要爬行的链接。我们需要在每个通过抓取上述结构找到的链接中查找与AI相关的文章。

我已经尝试了以下代码的各种变体，但似乎都没有多大帮助。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')

class_name='data'

dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)

我甚至很难得到那个页面上的链接，更不用说浏览每个链接来抓取其中与AI相关的文章了。

如果你能帮我完成问题的两个部分，那对我来说将是一个很好的学习。HTML结构请参考https://blog.feedspot.com/ai_rss_feeds/的源码。提前感谢！

beautifulsoup

python

web-scraping

回答 2

Stack Overflow用户

回答已采纳

发布于 2019-08-11 08:31:59

前20个结果存储在html中，如第页所示。其他的是从脚本标记中提取出来的，您可以对它们进行正则表达式，以创建完整的67个列表。然后循环该列表，并向它们发出请求以获取进一步的信息。我为初始列表填充提供了两个不同的选择器(第二个被注释掉的选择器使用:contains -可用于bs4 4.7.1+)

from bs4 import BeautifulSoup as bs
import requests, re

p = re.compile(r'feed_domain":"(.*?)",')

with requests.Session() as s:
    r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
    soup = bs(r.content, 'lxml')
    results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
    ## or use with bs4 4.7.1 + 
    #results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
    results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]

    for link in results:
        #do something e.g.
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        # extract info from indiv page

票数 1

Stack Overflow用户

发布于 2019-08-11 09:16:42

要获取每个块的所有子链接，可以使用soup.find_all

from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]

输出：

[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/@Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/@Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57446218

复制

相似问题

问使用Beautiful Soup在Web上浏览链接
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Beautiful Soup在Web上浏览链接EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Beautiful Soup在Web上浏览链接
EN