正则表达式在Python中拆分带有数字字符和后续值的字符串

2024-10-01 02:29:02 发布

您现在位置:Python中文网/ 问答频道 /正文

具有以下值列表:

['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate  unid']

如何使用re.split()拆分此列表以获取此表单:

['Champiñón' , '200 g',
'Zapallo italiano' , 'Unid.',
'Bolsa de zanahoria' ,'1 kg',
'Papa malla' ,'2 Kg',
'Palta Hass granel',
'Limón malla' ,'1 kg',
'Tomate granel',
'Brócoli' ,'1 un.',
'Tomate'  ,'unid']

Tags: 列表dekgpapaitalianozanahoriabolsamalla
2条回答

你可以这样做:

import re

data = ['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate  unid']


splitted = []

for line in data:
    value, unit, *_ = *re.split(' ((\d|unid).*)', line, flags=re.IGNORECASE), ''

    splitted.append(value)

    if unit:
        splitted.append(unit)

print(splitted)

在解析情况下,split()通常在您想要丢弃正在拆分的数据时效果最好。但您希望保留它,因此使用捕获方法可能会更好

import re

orig_vals = [
    'Champiñón 200 g',
    'Zapallo italiano Unid.',
    'Bolsa de zanahoria 1 kg',
    'Papa malla 2 Kg',
    'Palta Hass granel',
    'Limón malla 1 kg',
    'Tomate granel',
    'Brócoli 1 un.',
    'Tomate  unid',
]

# We will capture the two parts of interest and
# only throw away a space in the middle. This regex is
# not super robust, but it does work correctly for the
# example data you have supplied.
rgx = re.compile('(.+) ((\d|unid).*)', re.IGNORECASE)

new_vals = []
for ov in orig_vals:
    m = rgx.search(ov)
    new_vals.extend([m.group(1).rstrip(), m.group(2)] if m else [ov])

如果你真的想使用拆分,你可以编写一个更复杂的正则表达式,使用前瞻来防止消耗,从而丢弃我们正在拆分的文本

rgx2 = re.compile('(.+?) +(?=\d|unid)', re.IGNORECASE)

new_vals2 = [
    part
    for ov in orig_vals
    for part in rgx2.split(ov)
    if part
]

相关问题 更多 >