首先,我用漂亮的汤刮网
answers = soup.select('body > div.chg-body.no-nav.no-subnav.header-nav > div.chg-container.center-content > div.chg-container-content > div.chg-global-content > div > div.parent-container.question-headline > div.main-content.question-page > div.dialog-question > div.answers-wrap > ul > li > div > div.txt-body.answer-body > div.answer-given-body.ugc-base')在网络抓取之后,我得到了抓取的数据,但在抓取的数据中,一些链接是这样的:
src="//d2vlcm61l7u1fs.cloudfront.net/media%2F54b%2F54b505c2-d4e1-4745-8ab3-572866550500%2FphpvfFCYU.png"在我将抓取的数据保存为html后,图像不会显示在html页面上,因为它以//开头。如何添加https:并检查url是否有https
this is how the html document looks
请帮助检查抓取的数据中的urls是否以https开头,如果不是,则附加"https:“
发布于 2020-07-24 03:51:06
您可以使用str.startswith()进行检查,如果https属性以"//“开头,如果是,则添加”src:“。
例如:
from bs4 import BeautifulSoup
html_text = '''
<div>
<img src="//d2vlcm61l7u1fs.cloudfront.net/media%2F54b%2F54b505c2-d4e1-4745-8ab3-572866550500%2FphpvfFCYU.png" />
</div>
'''
soup = BeautifulSoup(html_text, 'html.parser')
for img in soup.select('img'):
if img['src'].startswith('//'):
print('https:' + img['src'])
else:
print(img['src'])打印:
https://d2vlcm61l7u1fs.cloudfront.net/media%2F54b%2F54b505c2-d4e1-4745-8ab3-572866550500%2FphpvfFCYU.pnghttps://stackoverflow.com/questions/63061830
复制相似问题