文章/答案/技术大牛

发布

社区首页 >问答首页 >如何编辑存储在列表中的链接

问如何编辑存储在列表中的链接
EN

Stack Overflow用户

提问于 2022-02-21 19:34:52

回答 1查看 25关注 0票数 -1

import requests
import re


def getHTMLdocument(url):
    response = requests.get(url)
    return response.text


url_to_scrape = 'https://www.parliament.gov.sg/about-us/structure/the-cabinet'
links = []

while True:

    html_document = getHTMLdocument(url_to_scrape)
    soup = BeautifulSoup(html_document, 'lxml')

    if soup.find_all('a', attrs={'href': re.compile("/details/")}) == []:
        break

    for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
        if link.get('href') not in links:
            links.append(link.get('href'))
            print(links)

当前，这是我拥有的代码，它为我提供了一个输出列表

'/mps/current-list-of-mps/mp/details/lee-hsien-loong'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/heng-swee-keat'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/teo-chee-hean'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tharman-shanmugaratnam'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ng-eng-hen'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/vivian-balakrishnan'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/k-shanmugam'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/gan-kim-yong'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/s-iswaran'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/grace-fu-hai-yien'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/chan-chun-sing'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/lawrence-wong'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/masagos-zulkifli-bin-masagos-mohamad'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ong-ye-kung'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/desmond-lee'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/josephine-teo'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/indranee-rajah'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/mohamad-maliki-bin-osman'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/edwin-tong-chun-fai'，'https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tan-see-leng‘

在我的下一部分代码中，我试图从每个链接中抓取数据，但是，由于列表中的第一个链接没有作为有效的url出现，所以我无法从其中获取信息。

如何编辑它，使其与列表中的其他urls相同？

非常感谢

list

url

web-scraping

beautifulsoup

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-21 19:42:52

在将字符串添加到列表之前，可以通过使用此代码检查他是否具有正确的格式，并在需要时进行更正：

def correct_url(url):

    if not url.startswith('https://www.parliament.gov.sg'):
        url = f'https://www.parliament.gov.sg{url}'
    return URL

新函数采用的for循环：

for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
        if link.get('href') not in links:
            links.append(correct_url(link.get('href')))
            print(links)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71212051

复制

相似问题

问如何编辑存储在列表中的链接
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何编辑存储在列表中的链接EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何编辑存储在列表中的链接
EN