用Python的re-modu匹配正则表达式组（带或）和特殊字符

# -*- coding: utf-8 -*- import re titles = [ 'Spaced (News)', 'Angry Birds [Game]', 'Cheats - for all games', # dash 'Cheats – for all games', # ndash 'Cheats — for all games', # mdash 'Cheats ― for all games' # horizontal bar ] regex = re.compile(r'^(?P<name>.+)\s+(([-–—―]\s+(?P<addition_a>.+))|([$\[](?P<addition_b>.+)[$\]]))$') for title in titles: data = {} match = regex.match(title.strip()) if match: data['name'] = match.group('name') try: data['addition'] = match.group('addition_a') except IndexError: pass try: data['addition'] = match.group('addition_b') except IndexError: pass print data

3条回答

网友

1楼 · 编辑于 2024-10-05 14:25:41

Unicode有占用超过一个字节的“字符”或“符号”，Python不太擅长理解这个概念，因此有时会出现一些问题。您可以执行以下操作之一：

您可以尝试确保正在解析的所有字符串都是unicode，如果您控制了这些字符串，则应该很简单—对于您的示例，只需在字符串的开头添加u指示符，如下所示：

u'Spaced (News)',
u'Angry Birds [Game]',
u'Cheats - for all games', # dash
u'Cheats – for all games', # ndash
u'Cheats — for all games', # mdash
u'Cheats ― for all games'  # horizontal bar

并将其添加到正则表达式中，如下所示：

^{pr2}$

否则，或者如果你不能控制，你可以做一个小的修改，虽然不是完全正确的-将工作。该更改是接受来自集合[-–—―]的多个字符，而不是通过执行[-–—―]+来接受单个字符：

r'^(?P<name>.+)\s+(([-–—―]+\s+(?P<addition_a>.+))|([\(\[](?P<addition_b>.+)[\)\]]))$'

这两个选项中的任何一个都会得到你想要的结果。在

第一个将产生unicode结果：

>>> 
{'addition': u'News', 'name': u'Spaced'}
{'addition': u'Game', 'name': u'Angry Birds'}
{'addition': None, 'name': u'Cheats'}
{'addition': None, 'name': u'Cheats'}
{'addition': None, 'name': u'Cheats'}
{'addition': None, 'name': u'Cheats'}

规则字符串中的第二个：

>>> 
{'addition': 'News', 'name': 'Spaced'}
{'addition': 'Game', 'name': 'Angry Birds'}
{'addition': None, 'name': 'Cheats'}
{'addition': None, 'name': 'Cheats'}
{'addition': None, 'name': 'Cheats'}
{'addition': None, 'name': 'Cheats'}

网友

2楼 · 编辑于 2024-10-05 14:25:41

一个稍微“大锤锤式”的方法是将整个re改为“一些单词和空格，直到它不存在，然后剩下的”。这也避免了可选的additional_a和additional_b命名组和try/except逻辑。在

示例：

for title in titles:
    data = dict(zip(['name', 'addition'], (m.strip() for m in re.findall('([\w\s]+)', title))))
    print data

输出：

^{pr2}$

网友

3楼 · 编辑于 2024-10-05 14:25:41

使用unicode文字。否则，[-–—―]匹配-，\xe2，\x80，\x93，\xe2，\x80，\x94，\xe2，\x80，\x95，而不是{}，–，—，―

# -*- coding: utf-8 -*-
import re
titles = [
    u'Spaced (News)',
    u'Angry Birds [Game]',
    u'Cheats - for all games', # dash
    u'Cheats – for all games', # ndash
    u'Cheats — for all games', # mdash
    u'Cheats ― for all games'  # horizontal bar
]
regex = re.compile(ur'^(?P<name>.+)\s+(([-–—―]\s+(?P<addition_a>.+))|([\(\[](?P<addition_b>.+)[\)\]]))$')
for title in titles:
    match = regex.match(title.strip())
    if match:
        data = {}
        data['name'] = match.group('name')
        data['addition'] = match.group('addition_a') or match.group('addition_b')
        print data

输出：

^{pr2}$

>>> r'[–]'
'[\xe2\x80\x93]'
>>> re.findall(r'[–]', '–')
['\xe2', '\x80', '\x93']
>>> re.findall(ur'[–]', u'–')
[u'\u2013']
>>> print re.findall(ur'[–]', u'–')[0]
–

相关问题更多 >

编程相关推荐

热门问题

热门文章