<p>这里有一个基于NumPy的方法来创建一个稀疏矩阵<a href="http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.coo_matrix.html" rel="nofollow">^{<cd1>}</a>,重点是内存效率-</p>
<pre><code>from scipy.sparse import coo_matrix
# Construct row IDs
lens = np.array([len(item) for item in dataset])
shifts_arr = np.zeros(lens.sum(),dtype=int)
shifts_arr[lens[:-1].cumsum()] = 1
row = shifts_arr.cumsum()
# Extract values from dataset into a NumPy array
arr = np.concatenate(dataset)
# Get the unique column IDs to be used for col-indexing into output array
col = np.unique(arr[:,0],return_inverse=True)[1]
# Determine the output shape
out_shp = (row.max()+1,col.max()+1)
# Finally create a sparse marix with the row,col indices and col-2 of arr
sp_out = coo_matrix((arr[:,1],(row,col)), shape=out_shp)
</code></pre>
<p>请注意,如果<code>IDs</code>应该是输出数组中的列号,那么您可以用这样的方法替换{<cd3>}的用法,它给我们提供了这样一个惟一的id-</p>
^{pr2}$
<p>这会给我们带来很好的性能提升!在</p>
<p>样本运行-</p>
<pre><code>In [264]: dataset = [[(1, 0.13), (2, 2.05)],
...: [(2, 0.23), (4, 7.35), (5, 5.60)],
...: [(2, 0.61), (3, 4.45)]]
In [265]: sp_out.todense() # Using .todense() to show output
Out[265]:
matrix([[ 0.13, 2.05, 0. , 0. , 0. ],
[ 0. , 0.23, 0. , 7.35, 5.6 ],
[ 0. , 0.61, 4.45, 0. , 0. ]])
</code></pre>