Python/Pandas按标准分组的最佳方式？

df_nonull1 = df_nonull[(df_nonull['mn_earn_wne_p6'] < 20000)] df_nonull2 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 20000) & (df_nonull['mn_earn_wne_p6'] < 30000)] df_nonull3 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 30000) & (df_nonull['mn_earn_wne_p6'] < 40000)] df_nonull4 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 40000)] df_nonull1['inc_index'] = 1 df_nonull2['inc_index'] = 2 df_nonull3['inc_index'] = 3 df_nonull4['inc_index'] = 4 frames = [df_nonull1,df_nonull2,df_nonull3,df_nonull4] results = pd.concat(frames)

2条回答

网友

1楼 · 编辑于 2024-09-28 01:23:57

编辑。正如保罗在评论中提到的，有一个pd.cut函数，它比我最初的答案优雅得多。在

# equal-width bins
df['inc_index'] = pd.cut(df.A, bins=4, labels=[1, 2, 3, 4])

# custom bin edges
df['inc_index'] = pd.cut(df.A, bins=[0, 20000, 30000, 40000, 50000],
                         labels=[1, 2, 3, 4])

请注意，labels参数是可选的。pd.cut生成一个ordered categorical ^{}，因此您可以根据结果列进行排序，而不考虑标签：

^{pr2}$

输出（模随机数）

    A   B inc_index
6   2  16    (0, 7]
7   5   5    (0, 7]
3  12   6   (7, 13]
4  10   8   (7, 13]
5   9  13   (7, 13]
1  15  10  (13, 15]
2  15   7  (13, 15]
8  15  13  (13, 15]
0  18  10  (15, 20]
9  16  12  (15, 20]

原始解。这是对Alexander's answer变桶宽的推广。您可以使用Series.apply构建inc_index列。例如

def bucket(v):
    # of course, the thresholds can be arbitrary
    if v < 20000:
        return 1
    if v < 30000:
        return 2
    if v < 40000:
        return 3
    return 4

df['inc_index'] = df.mn_earn_wne_p6.apply(bucket)

或者，如果你真的想避免def

df['inc_index'] = df.mn_earn_wne_p6.apply(
    lambda v: 1 if v < 20000 else 2 if v < 30000 else 3 if v < 40000 else 4)

请注意，如果您只想将mn_earn_wne_p6的范围细分为相等的桶，那么Alexander的方法更干净、更快。在

df['inc_index'] = df.mn_earn_wne_p6 // bucket_width

然后，为了得到您想要的结果，您可以按此列进行排序。在

df.sort_values('inc_index')

您还可以groupby('inc_index')在每个bucket中聚合结果。在

网友

2楼 · 编辑于 2024-09-28 01:23:57

如果所有值都在10k和50k之间，则可以使用整数除法（//）分配索引：

df_nonull['inc_index'] = df_nonull.mn_earn_wne_p6 // 10000

您不需要分解数据帧并将它们串联起来，您需要找到一种从mn_earn_wne_p6字段创建inc_index的方法。在

相关问题更多 >

编程相关推荐

热门问题

热门文章