我有一个包含URL的字符串:
string = https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F
我想把它们全部提取出来,得到这样的结果:
['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=','https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D','http%253A%252F%252Fwww.link-three.mu%252F']
我正在尝试:
urls = [x for x in re.split('(http[s]?)', string) if x]
print urls 结果是:
['https', '://www.link1.net/abc/cik?xai=En8MmT__aF_nQm- F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https', '://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http', '%253A%252F%252Fwww.link-three.mu%252F']
如果URL可以以“http”或“https”开头,我如何才能获得完整的URL?
有什么想法吗?
发布于 2018-02-08 04:43:20
在不使用re的情况下,您可以按如下方式处理此问题:
['http' + x for x in filter(lambda x: x, string.split('http'))]结果将是:
['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-
F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https://aax-us.link-
two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http%253A%252F%252Fwww.link-
three.mu%252F']发布于 2018-02-08 04:31:10
你可以使用你的结果,并加入两个连续的匹配,这将是工作。
urls = [urls[i]+urls[i+1] for i in range(0,len(urls),2)]但最好在https?或字符串末尾使用带有先行的findall:
import re
string = "https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F"
print(re.findall("https?.*?(?=https?|$)",string))结果:
['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=',
'https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D',
'http%253A%252F%252Fwww.link-three.mu%252F']正如评论中所提到的,由于您不能将:添加到分隔符,因此您无法确定URL分隔符(如果URL在您正在使用的地址中包含http )
https://stackoverflow.com/questions/48672653
复制相似问题