从pandas数据帧列中提取关键字单词,但不提取嵌套关键字

2024-09-22 20:39:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个Pandas DataFrame,其中包含两个自由格式的文本列,人们在其中描述其车辆的型号或装饰级别(例如:LE、1LT、RS、SS等)。在这些列中,有些人将只拥有模型(例如:LE),其他人将添加附加文本(例如:2dr Convertible SS w/2SS)。此外,模型级别具有一定的层次结构,即SS<;1SS<;2SS。在

我想提取这些模型或修剪级别,并在我的数据帧中创建一个新列(例如:1ls=1ls,ZL-1=ZL1,等等)

# the model can be stored in either 'submodel' or 'trim'
data = [{'SubModel': 'SS-EDITION', 'Trim': 'SS-EDITION(MANUAL 6 SPEED)  Coupe 2-Door'},
        {'SubModel': 'ZL1', 'Trim': 'ZL1 Coupe 2-Door'},
        {'SubModel': 'N/A', 'Trim': 'SS TRANSFORMER'},
        {'SubModel': '1LT RS AUTO BLUETOOTH REAR CAM', 'Trim': 'N/A'},
        {'SubModel': 'N/A', 'Trim': 'LS'},
        {'SubModel': 'Camaro SS', 'Trim': 'Camaro SS'},
        {'SubModel': 'Dusk Edition', 'Trim': 'N/A'},
        {'SubModel': 'Camaro SS W/ RS Pkg', 'Trim': 'Camaro SS W/ RS Pkg'},
        {'SubModel': '2dr Coupe SS w/2SS', 'Trim': '2dr Coupe SS w/2SS'},
        {'SubModel': '2dr Convertible LT w/1LT', 'Trim': '2dr Convertible LT w/1LT'},
        {'SubModel': 'N/A', 'Trim': '2LT'},
        {'SubModel': "LT RS 6-SPD SUNROOF REAR CAM 20'S", 'Trim': '1LT Coupe 2-Door'},
        {'SubModel': '2dr Convertible SS w/2SS', 'Trim': '2dr Convertible SS w/2SS'},
        {'SubModel': '2dr Convertible LT w/2LT', 'Trim': '2dr Convertible LT w/2LT'},
        {'SubModel': 'N/A', 'Trim': '2LT'},
        {'SubModel': 'N/A', 'Trim': 'RARE ZL1 - LOW MILES'},
        {'SubModel': "2SS AUTO LEATHER NAV HUD 20'S", 'Trim': 'SS Coupe 2-Door'},
        {'SubModel': 'SS', 'Trim': 'SS Coupe 2-Door'},
        {'SubModel': 'N/A', 'Trim': 'Car'},
        {'SubModel': 'N/A', 'Trim': '2LT'}]

# load data into dataframe
df = pd.DataFrame(data)

# create a dict of all models, including alternative spellings
models = {'LE' : 'LE',
          '1LE' : '1LE',
          '2LE' : '2LE',
          'LT' : 'LT',
          '1LT' : '1LT',
          '2LT' : '2LT',
          'LS' : 'LS',
          '1LS' : '1LS',
          '2LS' : '2LS',
          'SS' : 'SS',
          '1SS' : '1SS',
          '2SS' : '2SS',
          'ZL1' : 'ZL1',
          'ZL/1' : 'ZL1',
          'ZL-1' : 'ZL1',
          'COPO' : 'COPO',
          'copo' : 'copo'}

# look for each key in the models dict, and if found, return the value for that key for the column 'TRIM'
def trim_level(row):

    for key in models.keys():
        if key in (row['Trim'] or row['SubModel']):
            return models[key]


df['TRIM'] = df.apply(lambda row: trim_level(row), axis=1)

如下图所示,我现有的解决方案存在一个问题,即2SS被归类为SS,或者2LT被归类为LT。我也不知道如何处理描述中包含两个不同型号的人,例如SS w/2SS。在

^{pr2}$

Tags: thekeyinltlemodelsssrow