从字符串中提取所有URL的正则表达式

import re strings = "http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/" links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strings) print links // result always same as strings

3条回答

网友

1楼 · 编辑于 2024-09-28 01:32:59

您的问题是，http://被接受为url的有效部分。这是因为这里的代币：

[$-_@.&+]

或者更具体地说：

^{pr2}$

这将匹配范围从$到_的所有字符，其中包含的字符可能比您预期的要多得多。在

您可以将其更改为[$\-_@.&+]，但这会导致问题，因为现在，/字符将不匹配。所以用[$\-_@.&+/]添加它。但是，这将再次导致问题，因为http://example.com/path/topage.htmlhttp将被视为有效匹配。在

最后添加的是添加一个lookahead以确保您没有匹配http://或{}，这恰好是regex的第一部分！在

http[s]?://(?:(?!http[s]?://)[a-zA-Z]|[0-9]|[$\-_@.&+/]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

测试here

网友

2楼 · 编辑于 2024-09-28 01:32:59

问题是你的regex模式太包容了。它包括所有的url。可以通过使用（？）来使用lookahead？=）

试试这个：

re.findall("((www\.|http://|https://)(www\.)*.*?(?=(www\.|http://|https://|$)))", strings)

网友

3楼 · 编辑于 2024-09-28 01:32:59

一个简单而又不复杂的答案：

import re
url_list = []

for x in re.split("http://", l):
    url_list.append(re.split("https://",x))

url_list = [item for sublist in url_list for item in sublist]

如果要将字符串http://和https://追加回url，请对代码进行适当的更改。希望我能传达这个想法。在

相关问题更多 >

编程相关推荐

热门问题

热门文章