正则表达式来提取URL,而不使用不需要的单词

2024-10-03 04:32:35 发布

您现在位置:Python中文网/ 问答频道 /正文

目前我有如下字符串:

urls = [
    '<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
    '<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]

以及如下所示的正则表达式搜索:

for url in urls:
    match = re.search(r'href=[\'"]?([^\'" >]+)', url)
    if match:
        url = match.group(1)

url返回:

https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw
https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw

我希望过滤掉包含单词SWEC的url,这样第二个url字符串就不匹配了。我认为这可能与(?!SWEC)有关,但即使这是正确的,我也不确定如何将其合并到当前的正则表达式搜索中

如果能得到建议,我将不胜感激


Tags: httpsurlnetdownloadplusregionenergyhref
3条回答

这里有一个使用"The Greatest Regex Trick Ever"的解决方案

for url in urls:
    match = re.search(r'href=[\'"]?(?:[^\'" >]*SWEC[^\'" >]*|([^\'" >]+))', url)
    if match and len(match.group(1)) > 0:
        url = match.group(1)

诀窍是先匹配你不想要的,然后捕获你想要的。这仍然会将URL与SWEC匹配,但捕获组将为空,因此您需要调整代码以处理此问题

您可以将.*添加到负前瞻(?!.*SWEC),这样正则表达式将断言字符序列不匹配其后面紧跟单词SWEC的任何字符(换行符除外)。这种消极的前瞻不需要进入您的正则表达式捕获组,但它有助于减少查找有效匹配的步骤数

import re

urls = [
    '<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
    '<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]

for url in urls:
    match = re.search(r'href=[\'"]?((?!.*SWEC)[^\'" >]+)', url)
    if match:
        url = match.group(1)
        print(url)

# https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw

Regex101example

这里可能不需要正则表达式。例如

试试看:

# list of urls
urls = [
    '<a href=https://energyplus.net/weather-download/asia_wmo_region_2/IND//IND_Kota.424520_ISHRAE/IND_Kota.424520_ISHRAE.epw>Download Weather File</a>',
    '<a href=https://energyplus.net/weather-download/europe_wmo_region_6/ESP//ESP_Alicante.083600_SWEC/ESP_Alicante.083600_SWEC.epw>Download Weather File</a>'
]

# check length of list (2)
print(len(urls))

# loop through the list
for i, url in enumerate(urls):
#for url in urls: #if you remove the printing you can revert to this and delete the above enumerate line
    #check if the substring 'SWEC' is in the current element of the list
    if 'SWEC' in url:
        #if so delete that element
        urls.remove(url)
        #print a message to say it's been deleted
        print('Found.  Removing item ' + str(i))

# recheck the length of the list (1)
print(len(urls))

甚至:

urls = [x for x in urls if 'SWEC' not in x]

相关问题 更多 >