加快Pandas聚集

2024-10-02 04:27:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试计数pandas中的重复行DataFrame。我从一个csv文件中读取的数据如下所示

feature, IV, IT
early/J_result/N, True, False
early/J_result/N, True, False
early/J_result/N, True, False
excellent/J_result/N, True, True
hillsdown/N, True, False
hillsdown/N, True, False

上述示例输入的期望输出为

^{pr2}$

我现在的代码是:

import pandas as pd
def sum_up_token_counts(hdf_file):
    df = pd.read_csv(csv_file, sep=', ')
    counts = df.groupby('feature').count().feature
    assert counts.sum() == df.shape[0]  # no missing rows
    df = df.drop_duplicates()
    df.set_index('feature', inplace=True)
    df['count'] = counts
    return df

这和预期的一样,但需要很长时间。我分析了一下,看起来几乎所有的时间都花在了分组和计数上。在

Total time: 4.43439 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    28                                           
    29         1        57567  57567.0      1.3      df = pd.read_csv(hdf_file, sep=', ')
    30         1      4368529 4368529.0     98.5      counts = df.groupby('feature').count().feature
    31         1          174    174.0      0.0      assert counts.sum() == df.shape[0]  # no missing rows
    32         1         6234   6234.0      0.1      df = df.drop_duplicates()
    33         1          501    501.0      0.0      df.set_index('feature', inplace=True)
    34         1         1377   1377.0      0.0      df['count'] = counts
    35         1            1      1.0      0.0      return df

有什么想法可以加速这段代码吗?在


Tags: csvfalsetruepandasdfcountresultfeature
1条回答
网友
1楼 · 发布于 2024-10-02 04:27:10

使用master/0.14(很快就会发布),极大地加速了计数,请参见here

13.0.0与master的对比:

设置

In [1]: n = 10000

In [2]: offsets = np.random.randint(n, size=n).astype('timedelta64[ns]')

In [3]: dates = np.datetime64('now') + offsets

In [4]: dates[np.random.rand(n) > 0.5] = np.datetime64('nat')

In [5]: offsets[np.random.rand(n) > 0.5] = np.timedelta64('nat')

In [6]: value2 = np.random.randn(n)

In [7]: value2[np.random.rand(n) > 0.5] = np.nan

In [8]: obj = pd.util.testing.choice(['a', 'b'], size=n).astype(object)

In [9]: obj[np.random.randn(n) > 0.5] = np.nan

In [10]: df = DataFrame({'key1': np.random.randint(0, 500, size=n),
   ....:                 'key2': np.random.randint(0, 100, size=n),
   ....:                 'dates': dates,
   ....:                 'value2' : value2,
   ....:                 'value3' : np.random.randn(n),
   ....:                 'obj': obj,
   ....:                 'offsets': offsets})

v0.13.1版

^{pr2}$

0.14.0版

In [11]: %timeit df.groupby(['key1', 'key2']).count()
100 loops, best of 3: 6.25 ms per loop

相关问题 更多 >

    热门问题