目前,我有这样的字符串:
urls = [
'<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
'<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]一个正则表达式搜索如下:
for url in urls:
match = re.search(r'href=[\'"]?([^\'" >]+)', url)
if match:
url = match.group(1)url返回:
https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw
https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw我希望过滤掉包含单词SWEC的url,这样第二个url字符串就不匹配了。我认为这可能与(?!SWEC)有关,但即使这是正确的,我也不知道如何将其合并到当前的正则表达式搜索中。
如果你能给我建议的话,我很感激。
发布于 2021-07-24 13:52:17
这里有一个使用“史上最伟大的Regex诡计”的解决方案
for url in urls:
match = re.search(r'href=[\'"]?(?:[^\'" >]*SWEC[^\'" >]*|([^\'" >]+))', url)
if match and len(match.group(1)) > 0:
url = match.group(1)诀窍是先匹配你不想要的,然后捕捉你想要的。这仍然会与SWEC匹配urls,但是捕获组将是空的,因此您需要调整代码来处理这个问题。
发布于 2021-07-24 01:51:17
这里可能不需要Regex。例如..。
Try:
# list of urls
urls = [
'<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
'<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]
# check length of list (2)
print(len(urls))
# loop through the list
for i, url in enumerate(urls):
#for url in urls: #if you remove the printing you can revert to this and delete the above enumerate line
#check if the substring 'SWEC' is in the current element of the list
if 'SWEC' in url:
#if so delete that element
urls.remove(url)
#print a message to say it's been deleted
print('Found. Removing item ' + str(i))
# recheck the length of the list (1)
print(len(urls))或偶数:
urls = [x for x in urls if 'SWEC' not in x]发布于 2021-07-24 14:01:01
您可以将.*添加到负的(?!.*SWEC)中,这样正则表达式将断言字符序列不匹配其后面跟着单词SWEC的任何字符(换行符除外)。这种负面展望不需要进入regex捕获组,但它有助于减少查找有效匹配的步骤数。
import re
urls = [
'<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
'<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]
for url in urls:
match = re.search(r'href=[\'"]?((?!.*SWEC)[^\'" >]+)', url)
if match:
url = match.group(1)
print(url)
# https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epwRegex101 示例
https://stackoverflow.com/questions/68506298
复制相似问题