Regex查找包含子字符串但不包含其他子字符串的字符串中的所有URL

hi, this is your link (but this one is bad formatted and useless): https://www.test.comhttps://app.test.com/a/b/c/5e20bed422e7880012ba8acc/next?param=1?locale=2 but there is a good link too: https://app.test.com/a/b/c/5e20bed422e7880012ba8acc/next?param=1?locale=2 and there are also other irrelevant links: http://www.google.com http://test.test.com

2条回答

网友

1楼 · 编辑于 2024-10-08 18:30:04

像这样的怎么样

(https?:\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,6})(?=https?)(\S+)

我们将使用i标志进行不区分大小写的搜索

在这里测试：https://regex101.com/r/J62XZq/2

说明

https?:\/\/是查找http://或https://
[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,6}是查找有效的域名。我没有检查它是否真的是完全防弹的。但这似乎并不坏。我们也许可以找到一个官方的正则表达式来验证域名。(?:)组是一个非捕获组（如果我们不需要它）
(https?:\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,6})同时在一起并在一个组中捕获，因此我们拥有原始URL
(?=https?)是一个正向前瞻，因此前面的域后面必须跟有http或https。如果可以使用ftp或其他协议，则可能需要对其进行调整
(\S+)是将非空格匹配一次或多次，并将其捕获到一个组中（以供以后使用和处理。必须处理第二个组，以便除去第二个查询字符串?param=x&option，该字符串可能属于周围的URL

编辑

因为我们讨论了只匹配正确的答案，这意味着我的答案不是很好。要理解要做什么并不总是容易的

https://regex101.com/r/J62XZq/7

在这里，我们寻找一个域后面没有http:或https:的URL

诀窍是在开头添加\b，以避免与URL内的URL匹配，并在域后使用负前瞻

\bhttps?:\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,6}(?!https?:)\/\S+\/next\?(\S+)

使用(?!https?:)完成了负前瞻（我没有添加双斜杠，因为我认为它已经足够了）

带有/next的最后一部分可能不是必需的。这取决于您是否希望将URL与内部URL具体匹配

网友

2楼 · 编辑于 2024-10-08 18:30:04

使用：

\bhttps?://(?=[\w.]*/)(?:(?!https?://).)*

它会找到正确的url，并从您的示例中拒绝其他url

Demo & explanation

import re

body_text = '''
hi, this is your link (but this one is bad formatted and useless):

https://www.test.comhttps://app.test.com/a/b/c/5e20bed422e7880012ba8acc/next?param=1?locale=2

but there is a good link too:

https://app.test.com/a/b/c/5e20bed422e7880012ba8acc/next?param=1?locale=2

and there are also other irrelevant links:

http://www.google.com
http://test.test.com
'''
url = re.findall(r"\bhttps?://(?=[\w.]*/)(?:(?!https?://).)*", body_text)
print url

输出：

['https://app.test.com/a/b/c/5e20bed422e7880012ba8acc/next?param=1?locale=2']

说明

编辑

相关问题更多 >

编程相关推荐

热门问题

热门文章