<p>下面是使用Jupyter笔记本的三个最高票答案的性能比较。输入为1 m x 100 k随机稀疏矩阵,密度为0.001,包含100 m非零值:</p>
<pre><code>from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
</code></pre>
<h2><code>io.mmwrite</code>/<code>io.mmread</code></h2>
<pre><code>from scipy.sparse import io
%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s
%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>
Filesize: 3.0G.
</code></pre>
<p>(注意,格式已从csr更改为coo)。</p>
<h2><code>np.savez</code>/<code>np.load</code></h2>
<pre><code>import numpy as np
from scipy.sparse import csr_matrix
def save_sparse_csr(filename, array):
# note that .npz extension is added automatically
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
# here we need to add .npz extension manually
loader = np.load(filename + '.npz')
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s
%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
</code></pre>
<h2><code>cPickle</code></h2>
<pre><code>import cPickle as pickle
def save_pickle(matrix, filename):
with open(filename, 'wb') as outfile:
pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
with open(filename, 'rb') as infile:
matrix = pickle.load(infile)
return matrix
%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s
%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s
matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>
Filesize: 1.1G.
</code></pre>
<p><strong>注意</strong>:cPickle不能处理非常大的对象(请参见<a href="https://stackoverflow.com/a/38246020/304209">this answer</a>)。
以我的经验,它不适用于270M非零值的270万x 50万矩阵。
<code>np.savez</code>溶液效果良好。</p>
<h2>结论</h2>
<p>(基于这个简单的CSR矩阵测试)
<code>cPickle</code>是最快的方法,但它不适用于非常大的矩阵,<code>np.savez</code>只是稍微慢一点,而<code>io.mmwrite</code>则慢得多,生成更大的文件并还原到错误的格式。所以<code>np.savez</code>是这里的赢家。</p>