对于许多列，查找最大绝对值的有效方法

import random import pandas as pd random.seed(2) n_observations_per_user = 3 n_users = 3 n_dimensions = 2 ids = [] for i in range(n_users): ids += [i]*n_observations_per_user data = {"id": ids} for idim in range(n_dimensions): data[f"dim{idim}"] = [random.uniform(-10, 10) for i in range(n_observations_per_user*n_users)] df = pd.DataFrame(data) df id dim0 dim1 0 0 9.120685 2.136035 1 0 8.956550 1.624080 2 0 -8.868973 -6.832343 3 1 -8.302560 -1.386607 4 1 6.709978 -2.129364 5 1 4.719400 4.460242 6 2 3.394608 9.896391 7 2 -3.837271 8.987909 8 2 2.118883 0.883541

abs_max_fun = lambda x: x[x.abs().idxmax()] agg_dict_absmax = {"id": "first"} for idim in range(n_dimensions): agg_dict_absmax[f"dim{idim}"] = abs_max_fun df.groupby("id").agg(agg_dict_absmax) id dim0 dim1 id 0 0 9.120685 -6.832343 1 1 -8.302560 4.460242 2 2 -3.837271 9.896391

# Create new, large df, with the following: n_observations_per_user = 100 n_users = 1000 n_dimensions = 100 # Measure time for max-abs import time abs_max_fun = lambda x: x[x.abs().idxmax()] agg_dict_absmax = {"id": "first"} for idim in range(n_dimensions): agg_dict_absmax[f"dim{idim}"] = abs_max_fun start = time.time() df.groupby("id").agg(agg_dict_absmax) end = time.time() print(end - start)

import time agg_dict_max = {"id": "first"} for idim in range(n_dimensions): agg_dict_max[f"dim{idim}"] = "max" start = time.time() df.groupby("id").agg(agg_dict_max) end = time.time() print(end - start)

1条回答

网友

1楼 · 发布于 2024-10-02 00:27:18

在groupby过程中，您可以使用优化的内置操作获得每个组的最大值和最小值，然后再找出哪个绝对值更高，而不是对每个组进行（效率低下的）绝对最大值计算

import pandas as pd
import numpy as np

n_rows = 1_000_000
n_cols = 1_000
df = pd.DataFrame(np.random.random((n_rows, n_cols)) - 0.5)
df["group"] = np.random.randint(0, 400, (n_rows))

df_max = df.groupby("group").max()
df_min = df.groupby("group").min()
df_absmax = pd.DataFrame(
    np.where(df_max > -df_min, df_max, df_min),
    index=df_max.index,
    columns=df_max.columns
)

上面的示例运行时间是df.groupby("group").max()的两倍多

相关问题更多 >

编程相关推荐

热门问题

热门文章