正则表达式来提取URL，而不使用不需要的单词

urls = [ '<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>', '<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>' ]

https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw

3条回答

网友

1楼 · 编辑于 2024-10-03 04:32:35

这里有一个使用"The Greatest Regex Trick Ever"的解决方案

for url in urls:
    match = re.search(r'href=[\'"]?(?:[^\'" >]*SWEC[^\'" >]*|([^\'" >]+))', url)
    if match and len(match.group(1)) > 0:
        url = match.group(1)

诀窍是先匹配你不想要的，然后捕获你想要的。这仍然会将URL与SWEC匹配，但捕获组将为空，因此您需要调整代码以处理此问题

网友

2楼 · 编辑于 2024-10-03 04:32:35

您可以将.*添加到负前瞻(?!.*SWEC)，这样正则表达式将断言字符序列不匹配其后面紧跟单词SWEC的任何字符（换行符除外）。这种消极的前瞻不需要进入您的正则表达式捕获组，但它有助于减少查找有效匹配的步骤数

import re

urls = [
    '<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
    '<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]

for url in urls:
    match = re.search(r'href=[\'"]?((?!.*SWEC)[^\'" >]+)', url)
    if match:
        url = match.group(1)
        print(url)

# https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw

Regex101example

网友

3楼 · 编辑于 2024-10-03 04:32:35

这里可能不需要正则表达式。例如

试试看：

# list of urls
urls = [
    '<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
    '<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]

# check length of list (2)
print(len(urls))

# loop through the list
for i, url in enumerate(urls):
#for url in urls: #if you remove the printing you can revert to this and delete the above enumerate line
    #check if the substring 'SWEC' is in the current element of the list
    if 'SWEC' in url:
        #if so delete that element
        urls.remove(url)
        #print a message to say it's been deleted
        print('Found.  Removing item ' + str(i))

# recheck the length of the list (1)
print(len(urls))

甚至：

urls = [x for x in urls if 'SWEC' not in x]

相关问题更多 >

编程相关推荐

热门问题

热门文章