如何正确实现apply\u async进行数据处理？

import numpy as np import statsmodels.api as sm from statsmodels.regression.quantile_regression import QuantReg import multiprocessing from functools import partial def fit_model(data,q): #data is a 1-D array holding precipitation values years = np.arange(1895,2018,1) res = QuantReg(exog=sm.add_constant(years),endog=data).fit(q=q) pointEstimate = res.params[1] #output slope of quantile q return pointEstimate #precipAll is an array of shape (1405*621,123,12) (longitudes*latitudes,years,months) #find all indices where there is data nonNaN = np.where(~np.isnan(precipAll[:,0,0]))[0] #481631 indices month = 4 #holder array for results asyncResults = np.zeros((precipAll.shape[0])) * np.nan def saveResult(result,pos): asyncResults[pos] = result if __name__ == '__main__': pool = multiprocessing.Pool(processes=20) #my server has 24 CPUs for i in nonNaN: #use partial so I can also pass the index i so the result is #stored in the expected position new_callback_function = partial(saveResult, pos=i) pool.apply_async(fit_model, args=(precipAll[i,:,month],0.9),callback=new_callback_function) pool.close() pool.join()

1条回答

网友

1楼 · 发布于 2024-09-29 06:35:08

如果您需要使用multiprocessing模块，那么您可能希望将更多的行批处理到您分配给worker池的每个任务中。但是，对于您正在做的事情，我建议尝试Ray，因为它的efficient handling of large numerical data。你知道吗

import numpy as np
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import ray

@ray.remote
def fit_model(precip_all, i, month, q):
    data = precip_all[i,:,month]
    years = np.arange(1895, 2018, 1)
    res = QuantReg(exog=sm.add_constant(years), endog=data).fit(q=q)
    pointEstimate = res.params[1]
    return pointEstimate

if __name__ == '__main__':
    ray.init()

    # Create an array and place it in shared memory so that the workers can
    # access it (in a read-only fashion) without creating copies.
    precip_all = np.zeros((100, 123, 12))
    precip_all_id = ray.put(precip_all)

    result_ids = []
    for i in range(precip_all.shape[0]):
        result_ids.append(fit_model.remote(precip_all_id, i, 4, 0.9))

    results = np.array(ray.get(result_ids))

一些注释

上面的例子是开箱即用的，但请注意，我简化了一点逻辑。特别是，我删除了NaN的处理

在我有4个物理内核的笔记本电脑上，这大约需要4秒钟。如果你用20个核来代替，把数据放大9000倍，我估计需要7200秒，这是相当长的时间。一种可能的加速方法是使用更多的机器或在每次调用fit_model时处理多个行，以便分摊一些开销。你知道吗

上面的示例实际上将整个precip_all矩阵传递到每个任务中。这很好，因为每个fit_model任务只有对存储在共享内存中的矩阵副本的读取权限，因此不需要创建自己的本地副本。对ray.put(precip_all)的调用将数组放在共享内存的前面一次。你知道吗

关于differences between Ray and Python multiprocessing。注意我在帮助雷发展。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章