首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何编辑存储在列表中的链接

如何编辑存储在列表中的链接
EN

Stack Overflow用户
提问于 2022-02-21 19:34:52
回答 1查看 25关注 0票数 -1
代码语言:javascript
复制
import requests
import re


def getHTMLdocument(url):
    response = requests.get(url)
    return response.text


url_to_scrape = 'https://www.parliament.gov.sg/about-us/structure/the-cabinet'
links = []

while True:

    html_document = getHTMLdocument(url_to_scrape)
    soup = BeautifulSoup(html_document, 'lxml')

    if soup.find_all('a', attrs={'href': re.compile("/details/")}) == []:
        break

    for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
        if link.get('href') not in links:
            links.append(link.get('href'))
            print(links)

当前,这是我拥有的代码,它为我提供了一个输出列表

'/mps/current-list-of-mps/mp/details/lee-hsien-loong','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/heng-swee-keat','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/teo-chee-hean','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tharman-shanmugaratnam','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ng-eng-hen','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/vivian-balakrishnan','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/k-shanmugam','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/gan-kim-yong','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/s-iswaran','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/grace-fu-hai-yien','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/chan-chun-sing','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/lawrence-wong','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/masagos-zulkifli-bin-masagos-mohamad','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ong-ye-kung','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/desmond-lee','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/josephine-teo','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/indranee-rajah','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/mohamad-maliki-bin-osman','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/edwin-tong-chun-fai','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tan-see-leng‘

在我的下一部分代码中,我试图从每个链接中抓取数据,但是,由于列表中的第一个链接没有作为有效的url出现,所以我无法从其中获取信息。

如何编辑它,使其与列表中的其他urls相同?

非常感谢

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-21 19:42:52

在将字符串添加到列表之前,可以通过使用此代码检查他是否具有正确的格式,并在需要时进行更正:

代码语言:javascript
复制
def correct_url(url):

    if not url.startswith('https://www.parliament.gov.sg'):
        url = f'https://www.parliament.gov.sg{url}'
    return URL

新函数采用的for循环:

代码语言:javascript
复制
for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
        if link.get('href') not in links:
            links.append(correct_url(link.get('href')))
            print(links)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71212051

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档