Python正则表达式,用于打印(仅限于字母表)带空格的单词,并排除非ASCII字符

2024-10-01 11:38:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我的python函数定义如下:

def name_extractor(dirty_name):
    print Name
    clean_name = re.sub('\W'," ", dirty_name)
    print clean_name

脏名示例包含:

(10) Johny Doe
Eric E. Shelby
(1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
Jonas Alexander Bay
Christopher Rockstar - An awesome guy
Jones Collier

我只想打印输出:

Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier

如何调整正则表达式,使其只按原样打印名称,并排除“-”后的所有字符(随机字符或普通ascii字符)?你知道吗


Tags: nameclean字符chrisprintericdoealexander
2条回答

你不需要正则表达式。拆分' - '上的每一行,然后过滤掉不需要的字符,去掉多余的空白:

>>> l = '''(10) Johny Doe
... Eric E. Shelby
... (1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
... Jonas Alexander Bay
... Christopher Rockstar - An awesome guy
... Jones Collier'''.splitlines()
>>> for line in l:
...     print(''.join(c for c in line.split(' - ')[0] if c.isalpha() or c in ' .').strip())
...
Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier

要排除所有非ascii字符和所有其他在连字符-之后的字符,用空字符串""替换它们就足够了。
使用特定regex模式的简短解决方案:

dirty_name = '''
(10) Johny Doe
Eric E. Shelby
(1) Chris Melton - ŗ≤ēŗ≤Ņŗ≤įŗ≤Ņŗ≤ēŗ≥ć ŗ≤ēŗ≥Äŗ≤įŗ≥ćŗ≤§ŗ≤Ņ
Jonas Alexander Bay
Christopher Rockstar - An awesome guy
Jones Collier'''

clean_name = '\n'.join(l.lstrip() for l in re.sub(r'[^\x00-\x7f]|[\d()]| - .+\b(?=\n)', "", dirty_name).split('\n'))
print(clean_name)

输出:

Johny Doe
Eric E. Shelby
Chris Melton
Jonas Alexander Bay
Christopher Rockstar
Jones Collier

编辑:删除了左前导空格,因为@TigerhawkT3对空间太敏感了(在他自己的宗教中)

p.S.\x00-\x7fASCII字符范围

相关问题 更多 >