import requests
import re
def getHTMLdocument(url):
response = requests.get(url)
return response.text
url_to_scrape = 'https://www.parliament.gov.sg/about-us/structure/the-cabinet'
links = []
while True:
html_document = getHTMLdocument(url_to_scrape)
soup = BeautifulSoup(html_document, 'lxml')
if soup.find_all('a', attrs={'href': re.compile("/details/")}) == []:
break
for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
if link.get('href') not in links:
links.append(link.get('href'))
print(links)当前,这是我拥有的代码,它为我提供了一个输出列表
'/mps/current-list-of-mps/mp/details/lee-hsien-loong','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/heng-swee-keat','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/teo-chee-hean','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tharman-shanmugaratnam','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ng-eng-hen','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/vivian-balakrishnan','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/k-shanmugam','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/gan-kim-yong','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/s-iswaran','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/grace-fu-hai-yien','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/chan-chun-sing','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/lawrence-wong','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/masagos-zulkifli-bin-masagos-mohamad','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/ong-ye-kung','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/desmond-lee','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/josephine-teo','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/indranee-rajah','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/mohamad-maliki-bin-osman','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/edwin-tong-chun-fai','https://www.parliament.gov.sg/mps/list-of-current-mps/mp/details/tan-see-leng‘
在我的下一部分代码中,我试图从每个链接中抓取数据,但是,由于列表中的第一个链接没有作为有效的url出现,所以我无法从其中获取信息。
如何编辑它,使其与列表中的其他urls相同?
非常感谢
发布于 2022-02-21 19:42:52
在将字符串添加到列表之前,可以通过使用此代码检查他是否具有正确的格式,并在需要时进行更正:
def correct_url(url):
if not url.startswith('https://www.parliament.gov.sg'):
url = f'https://www.parliament.gov.sg{url}'
return URL新函数采用的for循环:
for link in soup.find_all('a', attrs={'href': re.compile("/details/")}):
if link.get('href') not in links:
links.append(correct_url(link.get('href')))
print(links)https://stackoverflow.com/questions/71212051
复制相似问题