熊猫-滚动就地重新整形

2024-06-25 23:08:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个大数据框架,我不能在这里分享。有509762行和49列。这是一个时间序列数据,每个22连续行都有相同的ID。我正在尝试将那些22行设为一行,并且我有时间限制。你知道吗

play_ids = df_train['PlayId'].unique()
df_train_plays = pd.DataFrame(np.zeros((len(play_ids), 385)))

player_features = ['Team', 'X', 'Y', 'S', 'A', 'Dis', 'Orientation', 'Dir', 'NflId', 'DisplayName', 'JerseyNumber', 'PlayerHeight', 'PlayerWeight', 'PlayerBirthDate', 'PlayerCollegeName', 'Position']
# Repeating Non-player features after flattening the play_df
play_cols_to_drop = [49, 50, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97]

for i, play in enumerate(play_ids):
    # Iterating over play_ids and making vectors from them
    play_df = df_train[df_train['PlayId'] == play].values.flatten()
    play_df = pd.DataFrame(play_df).T

    # Dropping the repeating non-player features 
    for col in play_cols_to_drop:
        cols = [i for i in range(col, 1078, 49)]
        play_df.drop(columns=cols, inplace=True)

    # Writing the values of new vector into the df_train_plays
    # This is the part which takes most of time... It has to be done inplace
    df_train_plays.loc[i, :] = play_df.values[0]
    print(f'Reshaped Play {i}')

我试过这种方法。它基本上迭代那些id,展平它们,删除重复的列并将其写入新的df。我要做这个手术23171次,时间太长了。写入一个新的数据帧需要太多的时间,所以操作必须到位。你知道吗

我换了另一种就地方法,从数据中得到了这个巨大的向量。它有2390万行。这次我必须把每1034行转换成一列。我怎样才能做到这一点没有循环?你知道吗

df_train_plays = df_train.set_index(['PlayId', 'Team'])

pd.concat([df_train_plays[col] for col in df_train_plays])

Tags: the数据inidsdfforplay时间