除括号外的所有逗号的索引

2024-09-28 17:17:50 发布

您现在位置:Python中文网/ 问答频道 /正文

如何排除小括号(少于20个字符)中的逗号

Get index of this comma, but (not this , comma). Get other commas like, or ,or, 1,1 2 ,2. (not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)

this example所有逗号索引的预期输出: [23, 71, 76, 79, 82, 87, 132]

enter image description here


Tags: orofgetindexnotthislikebut
3条回答

使用PyPi正则表达式:

,(?![^()]*\))|(?<=\((?=[^()]{20})[^()]*),

proof

Python code

import regex
text = r"Get index of this comma, but (not this , comma). Get other commas like, or ,or, 1,1 2 ,2. (not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)"
reg_expression = r',(?![^()]*\))|(?<=\((?=[^()]{20})[^()]*),'
print(regex.sub(reg_expression, '<COMMA>\g<0></COMMA>', text))
# Get index of this comma<COMMA>,</COMMA> but (not this , comma). Get other commas like<COMMA>,</COMMA> or <COMMA>,</COMMA>or<COMMA>,</COMMA> 1<COMMA>,</COMMA>1 2 <COMMA>,</COMMA>2. (not this ,) BUT (get index of this comma<COMMA>,</COMMA> if more than 20 characters are inside the parentheses)
indices = [x.start() for x in regex.finditer(reg_expression, text)]
print(indices)
# [23, 70, 75, 78, 81, 86, 131]

表达解释

--------------------------------------------------------------------------------
  ,                        ','
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    [^()]*                   any character except: '(', ')' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \)                       ')'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    \(                       '('
--------------------------------------------------------------------------------
    (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
      [^()]{20}                any character except: '(', ')' (20
                               times)
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
    [^()]*                   any character except: '(', ')' (0 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  ,                        ','

您还可以使用PyPi regex moduleSKIP FAIL来匹配和排除匹配结果中不需要的字符

在这种情况下,可以在不应匹配逗号的括号之间匹配1-20

\([^()]{1,20}\)(*SKIP)(*FAIL)|,

解释

  • \(匹配(
  • [^()]{1,20}匹配除()之外的任何字符的1-20倍
  • \)匹配)
  • (*SKIP)(*FAIL)从匹配结果中排除字符
  • |
  • ,匹配一个逗号

Regex demoPython demo

示例代码

import regex

s = """Get index of this comma, but (not this , comma). Get other commas like , or ,or, 1,1 2 ,2.
(not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)"""
pattern = r"\([^()]{1,20}\)(*SKIP)(*FAIL)|,"
indices = [m.start(0) for m in regex.finditer(pattern, s)]
print(indices)

输出

[23, 71, 76, 79, 82, 87, 132]

正则表达式模式:(,)|(\([^()]{0,20}\))

这种模式背后的直觉:

  • (,)查找所有逗号。这些存储在捕获组1中

  • (\([^()]{0,20}\))查找中间最多20个字符的所有括号。这些存储在捕获组2中

然后,我们可以找到组1中的所有匹配项,只排除长度为20的括号内的逗号

现在要查找这些匹配项的索引,请使用re.finditer()Match.start()Match.group()组合使用,以查找组1中每个匹配项的起始索引:

import re

string = """Get index of this comma, but (not this , comma). Get other commas like , or ,or, 1,1 2 ,2.
(not this ,) BUT (get index of this comma, if more than 20 characters are inside the parentheses)"""

indices = [m.start(1) for m in re.finditer('(,)|(\([^()]{0,20}\))', string) if m.group(1)]

print(indices)
# > [23, 71, 76, 79, 82, 87, 132]
print([string[index] for index in indices])
# > [',', ',', ',', ',', ',', ',', ',']

m.start(1)返回组1匹配的起始索引。由于re.finditer()返回来自所有捕获组的匹配项,因此添加if m.group(1)需要为组1找到匹配项(来自其他组的匹配项为None

编辑:这将忽略内部包含20个或更少字符的括号,这与第一条语句不一致,但与示例中解释的内容一致。如果希望小于20,只需使用{0,19}

相关问题 更多 >