有没有人知道修复“坏掉的”urls的库。当我尝试打开一个url时,比如
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuffurllib2.urlopen会阻塞并给我一个HTTPError回溯。有没有人知道可以修复这类问题的库?
发布于 2009-09-17 03:10:32
比如……怎么样?
import re
import urlparse
urls = '''
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
'''.split()
def main():
for u in urls:
pieces = list(urlparse.urlparse(u))
pieces[2] = re.sub(r'^[./]*', '/', pieces[2])
pieces[-1] = ''
print urlparse.urlunparse(pieces)
main()正如您所希望的,它确实会发出:
http://www.domain.com/page.html
http://www.domain.com/page.html
http://www.domain.com/page.html如果我理解正确的话,它看起来与你的需求大致相符。
https://stackoverflow.com/questions/1436382
复制相似问题