<p>另一种选择是使用构造函数
<code>csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])</code>来自<a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html" rel="noreferrer">scipy.sparse.csr_matrix</a>,其中<code>data</code>、<code>row_ind</code>和<code>col_ind</code>满足
关系<code>a[row_ind[k], col_ind[k]] = data[k]</code>。</p>
<p>诀窍是通过遍历文档并创建元组列表(doc-id,word-id)来生成<code>row_ind</code>和<code>col_ind</code>。<code>data</code>只是一个长度相同的向量。</p>
<p>将文档单词矩阵与其转置相乘将得到共现矩阵。</p>
<p>此外,这在运行时间和内存使用方面都是有效的,因此它还应该处理大的小体。</p>
<pre><code>import numpy as np
import itertools
from scipy.sparse import csr_matrix
def create_co_occurences_matrix(allowed_words, documents):
print(f"allowed_words:\n{allowed_words}")
print(f"documents:\n{documents}")
word_to_id = dict(zip(allowed_words, range(len(allowed_words))))
documents_as_ids = [np.sort([word_to_id[w] for w in doc if w in word_to_id]).astype('uint32') for doc in documents]
row_ind, col_ind = zip(*itertools.chain(*[[(i, w) for w in doc] for i, doc in enumerate(documents_as_ids)]))
data = np.ones(len(row_ind), dtype='uint32') # use unsigned int for better memory utilization
max_word_id = max(itertools.chain(*documents_as_ids)) + 1
docs_words_matrix = csr_matrix((data, (row_ind, col_ind)), shape=(len(documents_as_ids), max_word_id)) # efficient arithmetic operations with CSR * CSR
words_cooc_matrix = docs_words_matrix.T * docs_words_matrix # multiplying docs_words_matrix with its transpose matrix would generate the co-occurences matrix
words_cooc_matrix.setdiag(0)
print(f"words_cooc_matrix:\n{words_cooc_matrix.todense()}")
return words_cooc_matrix, word_to_id
</code></pre>
<p>运行示例:</p>
<pre><code>allowed_words = ['A', 'B', 'C', 'D']
documents = [['A', 'B'], ['C', 'B', 'K'],['A', 'B', 'C', 'D', 'Z']]
words_cooc_matrix, word_to_id = create_co_occurences_matrix(allowed_words, documents)
</code></pre>
<p>输出:</p>
<pre><code>allowed_words:
['A', 'B', 'C', 'D']
documents:
[['A', 'B'], ['C', 'B', 'K'], ['A', 'B', 'C', 'D', 'Z']]
words_cooc_matrix:
[[0 2 1 1]
[2 0 2 1]
[1 2 0 1]
[1 1 1 0]]
</code></pre>