用python的regex清理文本文件

3条回答

网友

1楼 · 编辑于 2024-09-28 01:33:11

我发现这个regex cheet sheet对于这种情况非常有用。在

# -*- coding: utf-8
import re
import string

u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
p = re.compile(r"[^\w\s\d{}]".format(re.escape(string.punctuation)))
for m in p.finditer(u):
    print m.group()

>>> 茅
>>> 茅
>>> 猫
>>> 猫

我也是^{}模块的超级粉丝。在

^{pr2}$

网友

2楼 · 编辑于 2024-09-28 01:33:11

您可以使用^{}模块。在

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>>

看来你要替换的代码是中文的。如果所有字符串都是unicode，则可以使用简单范围[\u4e00-\u9fa5]替换它们。这不是全部的中文，但已经足够了。在

^{pr2}$

网友

3楼 · 编辑于 2024-09-28 01:33:11

您可以不使用regex。在

要只保留ascii字符：

# -*- coding: utf-8 -*-
import unicodedata

unistr = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
unistr = unicodedata.normalize('NFD', unistr) # to preserve `e` in `é`
ascii_bytes = unistr.encode('ascii', 'ignore')

要删除除ascii字母、数字、标点符号以外的所有内容：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

用python的regex清理文本文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >