如何在featuretools中为具有相同id和时间索引的行创建特征?

2024-09-29 23:31:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个这样的数据帧

data = {'Customer':['C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3'],
        'NumOfItems':[3, 2, 4, 5, 5, 6, 10, 6, 14],
        'PurchaseTime':["2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03","2014-01-01", "2014-01-02", "2014-01-03"]
       }
df = pd.DataFrame(data)
df

我想创建一个功能,例如,到目前为止每个客户的最大值:

'MaxPerID(NumOfItems)':[3, 3, 4, 5, 5, 6, 10, 10, 14] #the output i want

所以我设置了EntitySet并将其规范化

es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="customer",
                              dataframe=df,
                              index='index',
                              time_index="PurchaseTime",
                             make_index=True)

es = es.normalize_entity(base_entity_id="customer",
                         new_entity_id="sessions",
                         index="Customer")

但是创建特征矩阵并不能产生我想要的结果

feature_matrix, features = ft.dfs(entityset=es,
                                 target_entity="customer",
                                 agg_primitives = ["max"],
                                 max_depth = 3                                      
                                 )
feature_matrix.head

sessions.MAX(customer.NumOfItems)  
index                                                                         
0                                      4                                    
3                                      6                                    
6                                     14                                    
1                                      4                                    
4                                      6                                    
7                                     14                                    
2                                      4                                    
5                                      6                                    
8                                     14                                    

返回的特性是所有客户每天的最大值(按时间排序),但是如果我运行相同的代码而不使用time_index = "PurchaseTime",结果就是特定客户的最大值

    sessions.MAX(customer.NumOfItems)  \
index                                                                       
0                    4   
1                    4   
2                    4   
3                    6   
4                    6   
5                    6   
6                   14   
7                   14   
8                   14   
                             

我想要这两个的组合:到目前为止特定客户的最大值。 这可能吗?我试着和es['customer']['Customer'].interesting_values =['C1', 'C2', 'C3']一起工作,但没有成功。我还尝试修改新的规范化实体,并为此编写自己的原语

我不熟悉featuretools,因此非常感谢您的帮助

This Question is similar to mine but the solution has no time_index and is creating the new features on the normalized entity


Tags: theiddfdataindex客户escustomer
1条回答
网友
1楼 · 发布于 2024-09-29 23:31:57

谢谢你的提问。通过使用group by transform原语,可以获得预期的输出

fm, fd = ft.dfs(
    entityset=es,
    target_entity="customer",
    groupby_trans_primitives=['cum_max'],
)

您应该获得每个客户的累计最大项数

column = 'CUM_MAX(NumOfItems) by Customer'
actual = fm[[column]].sort_values(column)
expected = {'MaxPerID(NumOfItems)': [3, 3, 4, 5, 5, 6, 10, 10, 14]}
actual.assign(**expected)
       CUM_MAX(NumOfItems) by Customer  MaxPerID(NumOfItems)
index
0                                  3.0                     3
1                                  3.0                     3
2                                  4.0                     4
3                                  5.0                     5
4                                  5.0                     5
5                                  6.0                     6
6                                 10.0                    10
7                                 10.0                    10
8                                 14.0                    14

相关问题 更多 >

    热门问题