文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用BeautifulSoup获取页面上的所有外部链接？

问如何使用BeautifulSoup获取页面上的所有外部链接？
EN

Stack Overflow用户

提问于 2018-09-23 18:19:17

回答 2查看 1.4K关注 0票数 1

我正在阅读“用Python进行Web抓取”一书，它具有以下功能来检索页面上的外部链接：

#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" that do
    #not contain the current URL
    for link in bs.find_all('a', {'href' : re.compile('^(http|www)((?!'+excludeUrl+').)*$')}):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

问题是，它的工作方式并不是它应该做的。当我使用URL：http://www.oreilly.com运行它时，它返回如下：

bs = makeSoup('https://www.oreilly.com') # Makes a BeautifulSoup Object
getExternalLinks(bs, 'https://www.oreilly.com')

输出：

['https://www.oreilly.com',
 'https://oreilly.com/sign-in.html',
 'https://oreilly.com/online-learning/try-now.html',
 'https://oreilly.com/online-learning/index.html',
 'https://oreilly.com/online-learning/individuals.html',
 'https://oreilly.com/online-learning/teams.html',
 'https://oreilly.com/online-learning/enterprise.html',
 'https://oreilly.com/online-learning/government.html',
 'https://oreilly.com/online-learning/academic.html',
 'https://oreilly.com/online-learning/pricing.html',
 'https://www.oreilly.com/partner/reseller-program.html',
 'https://oreilly.com/conferences/',
 'https://oreilly.com/ideas/',
 'https://oreilly.com/about/approach.html',
 'https://www.oreilly.com/conferences/',
 'https://conferences.oreilly.com/velocity/vl-ny',
 'https://conferences.oreilly.com/artificial-intelligence/ai-eu',
 'https://www.safaribooksonline.com/public/free-trial/',
 'https://www.safaribooksonline.com/team-setup/',
 'https://www.oreilly.com/online-learning/enterprise.html',
 'https://www.oreilly.com/about/approach.html',
 'https://conferences.oreilly.com/software-architecture/sa-eu',
 'https://conferences.oreilly.com/velocity/vl-eu',
 'https://conferences.oreilly.com/software-architecture/sa-ny',
 'https://conferences.oreilly.com/strata/strata-ca',
 'http://shop.oreilly.com/category/customer-service.do',
 'https://twitter.com/oreillymedia',
 'https://www.facebook.com/OReilly/',
 'https://www.linkedin.com/company/oreilly-media',
 'https://www.youtube.com/user/OreillyMedia',
 'https://www.oreilly.com/emails/newsletters/',
 'https://itunes.apple.com/us/app/safari-to-go/id881697395',
 'https://play.google.com/store/apps/details?id=com.safariflow.queue']

问题：

为什么最初的16-17条条目被认为是“外部链接”？它们属于http://www.oreilly.com的同一领域。

beautifulsoup

python

web-scraping

回答 2

Stack Overflow用户

回答已采纳

发布于 2021-08-05 13:58:01

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import urlsplit
import re
ext = set()
def getExt(url):
    o = urllib.parse.urlsplit(url)
    html = urlopen(url)
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href = re.compile('^((https://)|(http://))')):
        if 'href' in link.attrs:
            if o.netloc in (link.attrs['href']):
                continue
            else:
                ext.add(link.attrs['href'])
getExt('https://oreilly.com/')
for i in ext:
    print(i)

票数 0

Stack Overflow用户

发布于 2018-09-25 07:49:38

这两者之间有一个区别：

http://www.oreilly.com
https://www.oreilly.com

希望你明白我的意思。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/52468995

复制

相似问题

问如何使用BeautifulSoup获取页面上的所有外部链接？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用BeautifulSoup获取页面上的所有外部链接？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用BeautifulSoup获取页面上的所有外部链接？
EN