使用多个元素和多个分隔符解析字符串的熊猫

2024-09-30 08:24:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个27列的Excel文件。我有一个名为“Users”的列,它有多个元素。在Excel中,每个元素在同一单元格的新行上分隔,每个子元素用分号(;)分隔在圆括号(())之间。但是,括号也可能存在于子元素中。下面使用示例数据显示了表格在Excel中的显示方式

Data Table

下面是它如何使用Pandas作为数据帧导入Python

df = pd.DataFrame({'CN ON': ['WB-01','ZD-DD','DE-02','WZ-D8','HJ-78'],
                   'Type': ['First','Second','First','Second','Third'],
                   'Status': ['Completed','Started','Started','Final','Pending'],
                   'User': ['Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)',  'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nAdmin Assistant (WRIST PAD; wristpad@domain.com; 999 999 9999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)']
                   })

逻辑

现在,我想应用以下逻辑并创建一个名为ownership的新列。 如果Status = 'Completed'那么Ownership = 'Completed' 如果Status = 'Started',则Ownership =每个Admin的名称。 如果Status = 'Final',那么Ownership =每个Supervisor的名称。 如果Status = 'Pending',则Ownership =每个Admin Assistant的名称

可能的用户角色是'Admin', 'Admin Assistant', 'Supervisor','Alternative Supervisor'。括号内的第一个子元素是该角色中的人员的姓名。第二个子元素是电子邮件地址。第三个子元素是非标准化电话号码。它可以有破折号,括号,空格,或者全部加在一起。子元素中的分隔符是分号;。我相信元素之间的分隔符在Python中是\n,因为当我导入数据帧(使用上面的脚本)时,它就是这样显示的

Status     User Role
Started    Admin
Pending    Admin Assistant
Final      Supervisor
Completed  Completed

预期结果Desired Results

所需结果的Python脚本

df_results = pd.DataFrame({'CN ON': ['WB-01','ZD-DD','DE-02','WZ-D8','HJ-78'],
                           'Type': ['First','Second','First','Second','Third'],
                           'Status': ['Completed','Started','Started','Final','Pending'],
                           'User': ['Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)','Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)',  'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)', 'Admin (PAPER CLIP; paper.clip@domain.com; 999999999)\nAdmin (THUMB TACK; thumbtack@domain.com; 999-999-999)\nAdmin Assistant (MOUSE PAD; mousepad@domain.com; (999) 999999)\nAdmin Assistant (WRIST PAD; wristpad@domain.com; 999 999 9999)\nSupervisor (WHITE BOARD; whiteboard@domain.com; 999-999999)\nAlternative Supervisor (CHALK BOARD; chalkboard@domain.com; (999) 999-999)'],
                           'Ownership': ['Completed','PAPER CLIP','PAPER CLIP, THUMB TACK','WHITE BOARD','MOUSE PAD, WRIST PAD']
                           })

我尝试使用.split函数,但是我不知道如何使用多个元素和多个分隔符进行拆分,特别是当可能存在多个括号时。然后,我不知道如何仅从该字段中提取某些元素,因为根据另一个字段的条件,可能会有多个实例

任何指导或帮助将不胜感激!如果我需要澄清什么,请告诉我


Tags: boardcom元素clipadmindomainpaperassistant
1条回答
网友
1楼 · 发布于 2024-09-30 08:24:16

第一个任务是提取所有角色的名称,另一个任务很简单:

roles = (df['User'].str.split('\n', expand=True)
     .stack()
     .str.extract('^([\w\s]*)\s+\(([\w\s]*)[;|\)]')
     .reset_index()
     .groupby(['level_0', 0])[1]
     .agg(', '.join)
     .unstack(level=0)
)

# assign Completed ownership
roles['Completed'] = 'Completed'

ownership_mask = {
    'Started' : 'Admin',
    'Pending' : 'Admin Assistant',
    'Final'   : 'Supervisor',
    'Completed': 'Completed'
}

df['ownership'] = roles.lookup(df.index, df['Status'].map(ownership_mask))

输出(df['Status']):

0                 Completed
1                PAPER CLIP
2    PAPER CLIP, THUMB TACK
3               WHITE BOARD
4      MOUSE PAD, WRIST PAD
Name: ownership, dtype: object

注意我们可以使用np.select代替df['Status'].map

# no need to do `roles['Completed'] = 'Completed':
df['Ownership'] = np.select([df['Status'].eq('Started'),
                             df['Status'].eq('Pending'),
                             df['Status'].eq('Final')],
                            [roles['Admin'], roles['Admin Assistant'], roles['Supervisor'] ],
                            'Completed'
                           )

相关问题 更多 >

    热门问题