通过正则表达式使用替换方法连接术语

2条回答

网友

1楼 · 编辑于 2024-06-02 15:18:33

分裂溶液

虽然这不是一个正则表达式解决方案，但它确实有效：

from string import punctuation

x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
    if word == "and":
        # strip punctuation or we will get skin. instead of skin
        x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))

输出为：

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

此解决方案避免了直接插入列表，因为这样在迭代时会导致索引问题。相反，我们将列表中的第一个“and”替换为“synthesis and”，第二个“and”替换为“skin and”，然后重新连接拆分的字符串

正则表达式解

如果您坚持使用正则表达式解决方案，我建议将re.findall与包含单个和的模式一起使用，因为这对于问题更为普遍：

from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

我们再次使用strip(punctuation)，因为skin.被捕获：我们不想丢失句子末尾的标点符号，但我们确实想在句子内部丢失它

这是我们的模式：

(.*?)\sand\s(.*?)\s([^\s]+)

(.*?)\s：捕获“and”之前的所有内容，包括空格
\s(.*?)\s：捕捉紧跟在“and”后面的单词
([^\s]+)：捕获在下一个空格之前不是空格的任何内容（即“and”之后的第二个单词）。这确保我们也能捕捉标点符号

网友

2楼 · 编辑于 2024-06-02 15:18:33

您不需要导入punctuation，一个正则表达式可以工作：

import re
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
pattern = re.compile(r"(.*?)\s+and\s+(\S+)\s+(\S+)\b([_\W]*)", re.DOTALL)
result = ''.join([f"{a} {c} and {b} {c}{d}" for a,b,c,d in pattern.findall(x)])
print(result)

结果：Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

见Python proof

使用re.DOTALL允许点匹配换行符。
在末尾使用\b单词边界来剥离切分，并用([_\W]*)将其捕获到一个单独的组中。
使用\s+从结果中删除任意数量的空白字符。
[^\s]与\S相同，请将其缩短

见regex proof

解释

                                        
  (                        group and capture to \1:
                                        
    .*?                      any character (0 or more times (matching
                             the least amount possible))
                                        
  )                        end of \1
                                        
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
                                        
  and                      'and'
                                        
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
                                        
  (                        group and capture to \2:
                                        
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
                                        
  )                        end of \2
                                        
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
                                        
  (                        group and capture to \3:
                                        
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
                                        
  )                        end of \3
                                        
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
                                        
  (                        group and capture to \4:
                                        
    [_\W]*                   any character of: '_', non-word
                             characters (all but a-z, A-Z, 0-9, _) (0
                             or more times (matching the most amount
                             possible))
                                        
  )                        end of \4

分裂溶液

正则表达式解

相关问题更多 >

编程相关推荐

热门问题

热门文章