Python正则表达式意外替换中文字符

import re sourcepath = 'sourcefile.txt' destpath = 'result.txt' pattern = '[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,' source = open(sourcepath, 'r').read() dest = open(destpath, 'w') result = re.sub(pattern, ',', source) dest.write(result) dest.close()

3条回答

网友

1楼 · 编辑于 2024-05-20 14:10:35

如果你有奇数个中文“单词”，你的模式应该考虑重叠匹配。使用lookaheads：

re.sub(r'(?i)[A-Z]*[\u4300-\u9fff]+(?=\s+[A-Z]*[\u4300-\u9fff]+)', r'\g<0>,', source)
                                   ^^^                         ^

或者使用一个原子组模拟，在一个正的前瞻中结合消费模式中的反向引用进行捕获，并进行前瞻性检查，看是否已经有逗号：

^{pr2}$

请参阅regex demo（和demo 2）-不要注意\x{}符号，因为我使用的是PHP选项，所以它只用于演示）。在

参见IDEONE Python 3 demo：

import re
p = re.compile(r'[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', re.IGNORECASE | re.U)
test_str = "山牆 山墙,shan1 qiang2,gable\nB型超聲 B型超声, B xing2 chao1 sheng1,type-B ultrasound"
result = p.sub(r"\g<0>,", test_str)
print(result)
# => 山牆, 山墙,shan1 qiang2,gable
# => B型超聲, B型超声, B xing2 chao1 sheng1,type-B ultrasound

网友

2楼 · 编辑于 2024-05-20 14:10:35

I thought that by putting the \s character in parentheses, that it would make a capture group, and only that space would be replaced.

这不是捕获组的工作方式。所有匹配的内容仍然会被替换，但是对于捕获组，您可以引用替换中匹配的部分。在

我要改两行你的剧本：

pattern = '(?i)([a-z]*[\u4300-\u9fff]+)\s([a-z]*[\u4300-\u9fff]+)'

以及

^{pr2}$

网友

3楼 · 编辑于 2024-05-20 14:10:35

使用示例代码在Python 3.5上测试：

result = re.sub(r"([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)", r"\1,\2", subject, 0, re.IGNORECASE)

正则表达式解释

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章