具有多列条件的数据帧合并与比较

2024-09-27 04:22:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我不确定这是做我想做的事情的最好或正确的方式

我有以下建议:

df = pd.DataFrame(np.array([['1-1-2020', '123','How can I Help?', 'Delivered'], ['1-1-2020', '123','How can I Help?', 'Opened'], ['1-2-2021', '100','New Offer', 'Delivered'],['1-2-2021', '100','New Offer', 'Delivered'],['1-4-2021', '144','Last chance, buy now!', 'Delivered'],['1-4-2021', '144','Last chance, buy now!', 'Delivered'],['2-4-2021', '144','Last chance, buy now!', 'Opened']]),

                   columns=['Date', 'Customer_ID','Subject', 'Status'])


    Date    Customer_ID     Subject              Status
0   1-1-2020    123     How can I Help?         Delivered
1   1-1-2020    123     How can I Help?         Opened
2   1-2-2021    100     New Offer               Delivered
3   1-2-2021    100     New Offer               Delivered
4   1-4-2021    144     Last chance, buy now!   Delivered
5   1-4-2021    144     Last chance, buy now!   Delivered
6   2-4-2021    144     Last chance, buy now!   Opened

在此df中,客户123收到一封电子邮件,然后在第二行打开。 客户100的电子邮件发送了两次 客户144的电子邮件发送了两次,其中一封打开了

我正在尝试跟踪每个客户的每封电子邮件的已交付和打开状态,以及最后的行动日期

因此,我创建了两个数据帧:一个用于交付,一个用于打开的数据帧,并将它们合并到交付的数据帧上,以跟踪打开的数据帧

df_del = df.loc[(df['Status'] == 'Delivered')]
df_open = df.loc[(df['Status'] == 'Opened')]

d = df_del.rename(columns={'Date': 'Date Delivered'})
o = df_open.rename(columns={'Date': 'last action date', 'Status': 'Open Status'})

w = d.merge(o, on=['Customer_ID','Subject'], how='left')

w

这表明:

Date Delivered  Customer_ID       Subject            Status     last action date Open Status
0   1-1-2020    123         How can I Help?           Delivered     1-1-2020       Opened
1   1-2-2021    100         New Offer                 Delivered         NaN        NaN
2   1-2-2021    100         New Offer                 Delivered         NaN        NaN
3   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened
4   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened

我期待的是:

Date Delivered  Customer_ID       Subject            Status     last action date Open Status
0   1-1-2020    123         How can I Help?           Delivered     1-1-2020       Opened
1   1-2-2021    100         New Offer                 Delivered     1-2-2021       NaN
2   1-2-2021    100         New Offer                 Delivered     1-2-2021       NaN
3   1-4-2021    144         Last chance, buy now!     Delivered     2-4-2021       Opened
4   1-4-2021    144         Last chance, buy now!     Delivered     1-4-2021       NaN

Tags: dfnewdatestatushelpbuynancan
3条回答

通过^{}和通过^{}生成NaN填充的伪消息id略有不同的另一个选项:

# Create a "message_id"
df['m_id'] = (
    df.groupby(['Customer_ID', 'Subject', 'Status']).cumcount()
)

# Create Mask For Delivered Status
m = df.Status.eq('Delivered')

# Merge Delivered and ~Delivered
df = (
    df[m].rename(columns={'Date': 'Date Delivered'})
        .merge(df[~m].rename(columns={'Date': 'last action date',
                                      'Status': 'Open Status'}),
               on=['Customer_ID', 'Subject', 'm_id'],
               how='left')
)

# Fill NaN in last action date column
df['last action date'] = (
    df['last action date'].combine_first(df['Date Delivered'])
)

df

  Date Delivered Customer_ID                Subject     Status  m_id last action date Open Status
0       1-1-2020         123        How can I Help?  Delivered     0         1-1-2020      Opened
1       1-2-2021         100              New Offer  Delivered     0         1-2-2021         NaN
2       1-2-2021         100              New Offer  Delivered     1         1-2-2021         NaN
3       1-4-2021         144  Last chance, buy now!  Delivered     0         2-4-2021      Opened
4       1-4-2021         144  Last chance, buy now!  Delivered     1         1-4-2021         NaN

只是使用np.wheregroupby添加另一种方式

df['last action date'] = df.groupby('Customer_ID').Date.transform('last')
df['op'] = (
    df.groupby(['Customer_ID', 'Subject'])['Status'].cumcount()
)
df['Open Status'] = np.where((df.groupby(['Customer_ID'])\
    .Status.transform('last') == 'Opened') & (df.op==0), 'Opened',np.nan)
df[df.Status=='Delivered'].drop(columns=['op'])

输出

    Date    Customer_ID Subject             Status  last action date    Open Status
0   1-1-2020    123 How can I Help?         Delivered   1-1-2020    Opened
2   1-2-2021    100 New Offer               Delivered   1-2-2021    nan
3   1-2-2021    100 New Offer               Delivered   1-2-2021    nan
4   1-4-2021    144 Last chance, buy now!   Delivered   2-4-2021    Opened
5   1-4-2021    144 Last chance, buy now!   Delivered   2-4-2021    nan

让我们使用伪“顺序”列:

df_del = df.loc[(df['Status'] == 'Delivered')].copy()
df_open = df.loc[(df['Status'] == 'Opened')].copy()

df_del['order'] = df_del.groupby(['Customer_ID']).cumcount()
df_open['order'] = df_open.groupby(['Customer_ID']).cumcount()

d = df_del.rename(columns={'Date': 'Date Delivered'})
o = df_open.rename(columns={'Date': 'last action date', 'Status': 'Open Status'})

w = d.merge(o, on=['Customer_ID','Subject','order'], how='left')

w['last action date'] = w['last action date'].fillna(w['Date Delivered'])

W

输出:

  Date Delivered Customer_ID                Subject     Status  order last action date Open Status
0       1-1-2020         123        How can I Help?  Delivered      0         1-1-2020      Opened
1       1-2-2021         100              New Offer  Delivered      0         1-2-2021         NaN
2       1-2-2021         100              New Offer  Delivered      1         1-2-2021         NaN
3       1-4-2021         144  Last chance, buy now!  Delivered      0         2-4-2021      Opened
4       1-4-2021         144  Last chance, buy now!  Delivered      1         1-4-2021         NaN

相关问题 更多 >

    热门问题