<p>谢谢你花时间回答我!我提出了另一个解决方案,我想与大家分享<a href="http://yaoyao.codes/pandas/2018/01/23/pandas-split-a-dataframe-into-chunks" rel="nofollow noreferrer">YaoYao</a>启发我测试多处理库来解决我的问题。下面的解决方案运行速度非常快。该脚本首先读取.csv文件,确定数值列的名称并计算其总和。然后,它将数据集分割为大小相等的块,并将规范化函数映射到每个块。生成的规范化数据帧被连接回以生成整个规范化数据集</p>
<pre><code>import pandas as pd
import math
import numpy as np
from multiprocessing import Pool
def index_marks(nrows, chunk_size):
return range(chunk_size, math.ceil(nrows / chunk_size) * chunk_size, chunk_size)
def split(dfm, chunk_size):
indices = index_marks(dfm.shape[0], chunk_size)
return np.split(dfm, indices)
def normalize(df, sums, numeric_cols):
for col in numeric_cols:
df.insert(df.columns.get_loc(col)+1, col+'_norm', ((df[col]/sums[col])*1000000)+1)
return df
def parallel_normalize(data):
df = data[0]
sums = data[1]
numeric_cols = data[2]
df = normalize(df, sums, numeric_cols)
return df
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataset_file = '/path/to/file.csv'
cores = 3
dataset = pd.read_csv(dataset_file, index_col=['Guide1', 'Guide2'], sep=',')
dataset = dataset.fillna(0.0)
numeric_cols = [col for col in dataset.columns if dataset[col].dtype in numerics]
sums = dataset[numeric_cols].sum(axis=0, skipna = True)
chunks = split(dataset, int(round(dataset.shape[0]/cores)))
pool = Pool(cores)
dataset = pd.concat(pool.map(parallel_normalize, [[x, sums, numeric_cols] for x in chunks]))
pool.close()
pool.join()
</code></pre>
<p>不过,我不知道为什么速度会如此之快。我本来预计会有60%的加速,因为我把数据分成了3块。但它确实是瞬间运行的。如果有人能在这里发表评论并给出一些见解,我将非常高兴</p>