以便携式数据形式保存/加载scipy稀疏csr_矩阵

2024-09-26 22:53:09 发布

您现在位置:Python中文网/ 问答频道 /正文

如何以可移植格式保存/加载scipy稀疏csr_matrix?scipy稀疏矩阵是在Python 3(Windows 64位)上创建的,在Python 2(linux64位)上运行。最初,我使用pickle(protocol=2,fix_imports=True),但从Python 3.2.2(Windows 64位)到Python 2.7.2(Windows 32位)都不起作用,并得到错误:

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

接下来,尝试了numpy.savenumpy.load以及scipy.io.mmwrite()scipy.io.mmread(),但这些方法都不起作用。


Tags: ionumpytruewindows格式type矩阵scipy
3条回答

下面是使用Jupyter笔记本的三个最高票答案的性能比较。输入为1 m x 100 k随机稀疏矩阵,密度为0.001,包含100 m非零值:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

io.mmwrite/io.mmread

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

(注意,格式已从csr更改为coo)。

np.savez/np.load

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

cPickle

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

注意:cPickle不能处理非常大的对象(请参见this answer)。 以我的经验,它不适用于270M非零值的270万x 50万矩阵。 np.savez溶液效果良好。

结论

(基于这个简单的CSR矩阵测试) cPickle是最快的方法,但它不适用于非常大的矩阵,np.savez只是稍微慢一点,而io.mmwrite则慢得多,生成更大的文件并还原到错误的格式。所以np.savez是这里的赢家。

尽管你写,scipy.io.mmwritescipy.io.mmread对你不起作用,但我只想补充一下它们是如何工作的。这个问题是Google的头号热门话题,所以在切换到简单而明显的scipy函数之前,我自己先从np.savezpickle.dump开始。他们为我工作,不应该被那些还没有尝试过的人监督。

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

编辑:SciPy 1.19现在有^{}^{}

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

对于这两个函数,file参数也可以是类似文件的对象(即open的结果),而不是文件名。


从Scipy用户组得到答案:

A csr_matrix has 3 data attributes that matter: .data, .indices, and .indptr. All are simple ndarrays, so numpy.save will work on them. Save the three arrays with numpy.save or numpy.savez, load them back with numpy.load, and then recreate the sparse matrix object with:

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

例如:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

相关问题 更多 >

    热门问题