构建仅提取域的正则表达式

'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', 'http://www.interactivedynamicvideo.com/', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', 'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/', 'HTTPS://github.com/keppel/pinn', 'Http://phys.org/news/2015-09-scale-solar-youve.html', 'https://iot.seeed.cc', 'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html', 'http://beta.crowdfireapp.com/?beta=agnipath', 'https://www.valid.ly?param', 'http://css-cursor.techstream.org'

3条回答

网友

1楼 · 编辑于 2024-10-04 11:23:38

根据regexr.com的说法，这应该满足您的需求，而且更简单： (?<=\/\/)([^/?']*)。毕竟，域实际上就是从//到下一个/或？或字符串的结尾

网友

2楼 · 编辑于 2024-10-04 11:23:38

正则表达式是一个硬性要求吗，因为您需要将它与现有正则表达式相结合？如果没有，标准库中有一个简单的工具：

from urllib.parse import urlparse

urls = [
    'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
    'http://www.interactivedynamicvideo.com/',
    'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
    'http://evonomics.com/advertising-cannot-maintain-internet-heres-solution/',
    'HTTPS://github.com/keppel/pinn',
    'Http://phys.org/news/2015-09-scale-solar-youve.html',
    'https://iot.seeed.cc',
    'http://www.bfilipek.com/2016/04/custom-deleters-for-c-smart-pointers.html',
    'http://beta.crowdfireapp.com/?beta=agnipath',
    'https://www.valid.ly?param',
    'http://css-cursor.techstream.org',
]

domains = [urlparse(url).netloc for url in urls]
print(domains)

我想正则表达式更快：

>>> netloc = re.compile(r'^https?://([^/?^]+)', flags=re.I)                                                                                                    
>>> %timeit [netloc.match(url).group(1) for url in urls]                                                                                                       
5.66 µs ± 97.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit [urlparse(url).netloc for url in urls]                                                                                                             
23.3 µs ± 3.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

网友

3楼 · 编辑于 2024-10-04 11:23:38

对于示例数据，您可以使用com{}{}和cc的替换，并转义点以逐字匹配它

要匹配css-cursor.techstream.org，可以使用重复组匹配-或.

注意[^\/\/]与[^/]相同，并且匹配除/之外的任何字符

\w+(?:[.-]\w+)*\.(?:ly|org|com|cc)\b

\w+匹配1+字字符
(?:[.-]\w+)*可选地重复匹配.或-和1+字字符
\.匹配一个升点（注意逃逸）
(?:ly|org|com|cc)非捕获组，匹配任何选项
\b防止部分匹配的单词边界

Regex demo

如果还希望与协议匹配，则可以使用所需字符串的捕获组

\bhttps?://(\w+(?:[.-]\w+)*\.(?:ly|org|com|cc))\b

Regex demo

相关问题更多 >

编程相关推荐

热门问题

热门文章