Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word
and
in regex as partition to separate two groups of the sentence. For example:
Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'
What Regex I have tried:
import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin."
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))
正则表达式能够捕获组,但是我从substitute
方法行中得到了错误TypeError: 'NoneType' object is not subscriptable
。如有任何建议或帮助解决上述问题,我们将不胜感激
分裂溶液
虽然这不是一个正则表达式解决方案,但它确实有效:
输出为:
此解决方案避免了直接插入列表,因为这样在迭代时会导致索引问题。相反,我们将列表中的第一个“and”替换为“synthesis and”,第二个“and”替换为“skin and”,然后重新连接拆分的字符串
正则表达式解
如果您坚持使用正则表达式解决方案,我建议将
re.findall
与包含单个和的模式一起使用,因为这对于问题更为普遍:我们再次使用
strip(punctuation)
,因为skin.
被捕获:我们不想丢失句子末尾的标点符号,但我们确实想在句子内部丢失它这是我们的模式:
(.*?)\s
:捕获“and”之前的所有内容,包括空格\s(.*?)\s
:捕捉紧跟在“and”后面的单词([^\s]+)
:捕获在下一个空格之前不是空格的任何内容(即“and”之后的第二个单词)。这确保我们也能捕捉标点符号李>您不需要导入
punctuation
,一个正则表达式可以工作:结果:
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
见Python proof
使用
re.DOTALL
允许点匹配换行符。在末尾使用
\b
单词边界来剥离切分,并用([_\W]*)
将其捕获到一个单独的组中。使用
\s+
从结果中删除任意数量的空白字符。[^\s]
与\S
相同,请将其缩短见regex proof
解释
相关问题 更多 >
编程相关推荐