numpy中在bins上平均大數據集

2024-10-04 07:38:12 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个大的（~100GB）数据集xs的结构化numpy数组x，我想用一个属性p1对每个数据集进行分类，并找出每个数据集中属性p2的平均值和标准差。我的方法如下所示的工作，但相当缓慢。有没有更快捷的方法？我无法在内存中容纳整个数据集，但我确实有很多内核，所以一个很好的并行化方法也会很好。你知道吗

nbins=30
bin_edges=np.linspace(0,1,nbins) 

N, p2_total, means_p2, stds_p2 = np.zeros((4,nbins))      

for x in xs_generator():
    p1s = x['p1']
    p2s = x['p2']

    which_bin=np.digitize(p1s,bins=bin_edges)

    for this_bin,bin_edge in enumerate(bin_edges):
        these_p1s    = p1s[which_bin==this_bin]
        these_p2s    = p2s[which_bin==this_bin]

        N[this_bin]          += np.size(these_p1s)
        p2_total[this_bin]   += np.sum(these_p2s)
        p2sq_total[this_bin] += np.sum(these_p2s**2)

means_p2 = p2_total/N
stds_p2  = np.sqrt(p2sq_total/N**2)

Tags：数据方法 which bin 属性 np this total

1条回答

网友

1楼 · 发布于 2024-10-04 07:38:12

你应该使用np.直方图地址：

N, binDump = np.histogram( p1s, bins=bin_edges )
p2_total, binDump = np.histogram( p1s, bins=bin_edges, weights=p2s )
p2sq_total, binDump = np.histogram( p1s, bins=bin_edges, weights=p2s**2 )

means_p2 = p2_total/N
stds_p2  = np.sqrt(p2sq_total/N**2)

这样就避免了循环，只需重新编写直方图函数：）

numpy中在bins上平均大數據集

相关问题更多 >

编程相关推荐

热门问题

热门文章

numpy中在bins上平均大數據集

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >