通过正则表达式使用替换方法连接术语

2024-06-02 15:18:33 发布

您现在位置:Python中文网/ 问答频道 /正文

Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word and in regex as partition to separate two groups of the sentence. For example:

Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'

Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'

What Regex I have tried:

import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin." 
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))

正则表达式能够捕获组,但是我从substitute方法行中得到了错误TypeError: 'NoneType' object is not subscriptable。如有任何建议或帮助解决上述问题,我们将不胜感激


Tags: andofthetoinhavegroupnot
2条回答

分裂溶液

虽然这不是一个正则表达式解决方案,但它确实有效:

from string import punctuation

x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
    if word == "and":
        # strip punctuation or we will get skin. instead of skin
        x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))

输出为:

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

此解决方案避免了直接插入列表,因为这样在迭代时会导致索引问题。相反,我们将列表中的第一个“and”替换为“synthesis and”,第二个“and”替换为“skin and”,然后重新连接拆分的字符串

正则表达式解

如果您坚持使用正则表达式解决方案,我建议将re.findall与包含单个和的模式一起使用,因为这对于问题更为普遍:

from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

我们再次使用strip(punctuation),因为skin.被捕获:我们不想丢失句子末尾的标点符号,但我们确实想在句子内部丢失它

这是我们的模式:

(.*?)\sand\s(.*?)\s([^\s]+)
  1. (.*?)\s:捕获“and”之前的所有内容,包括空格
  2. \s(.*?)\s:捕捉紧跟在“and”后面的单词
  3. ([^\s]+):捕获在下一个空格之前不是空格的任何内容(即“and”之后的第二个单词)。这确保我们也能捕捉标点符号

您不需要导入punctuation,一个正则表达式可以工作:

import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)

结果Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

Python proof

使用re.DOTALL允许点匹配换行符。
在末尾使用\b单词边界来剥离切分,并用([_\W]*)将其捕获到一个单独的组中。
使用\s+从结果中删除任意数量的空白字符。
[^\s]\S相同,请将其缩短

regex proof

解释

                                        
  (                        group and capture to \1:
                                        
    .*?                      any character (0 or more times (matching
                             the least amount possible))
                                        
  )                        end of \1
                                        
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
                                        
  and                      'and'
                                        
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
                                        
  (                        group and capture to \2:
                                        
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
                                        
  )                        end of \2
                                        
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
                                        
  (                        group and capture to \3:
                                        
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
                                        
  )                        end of \3
                                        
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
                                        
  (                        group and capture to \4:
                                        
    [_\W]*                   any character of: '_', non-word
                             characters (all but a-z, A-Z, 0-9, _) (0
                             or more times (matching the most amount
                             possible))
                                        
  )                        end of \4

相关问题 更多 >