我需要在一个平面上计算余弦相似性scipy.sparse.csr文件.csr\u矩阵。使用最直接的sklearn实现,我遇到了较大矩阵形状的内存错误。即使是较小的形状,性能也不是很好,CPU负载也不会超过25%。所以我想让它更快更稳定,也适用于更大的数据集。你知道吗
我发现了一个非常棒的resource about the speed issue,我删除了最初文章中除最快版本之外的所有版本,并将我的简单sklearn实现添加为“方法2”。我确认在我的机器(32gbram,WIN10,python3.6.4)上,“方法1”只运行了“方法2”使用代码中构造的数据集所需的大约4%的时间。下面是改编自zbinsd的代码:
# Code adopted from zbinsd @ https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1
# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T
# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))
print("Input data shape:", Asp.shape)
# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
if method == 1:
similarity = np.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
return cosine.T * inv_mag
if method == 2:
return cosine_similarity(Asp)
# Assert that all results are consistent with the first model ("truth")
for m in range(1, 3):
if m in [2]: # The sparse case
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
else:
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))
# Time them:
print("Method 1")
%timeit calc_sim(A, method=1)
print("Method 2")
%timeit calc_sim(A, method=2)
我还找到了一个好的resource about the memory issue 结果发现我已经考虑了icm's建议,只使用了唯一的引用,所以不知道如何进一步改进。你知道吗
移动到源于sklearn计数矢量器的原始数据
TFvectorizer = CountVectorizer(lowercase=False, tokenizer=log_tokenizer, ngram_range=(1,1))
TF = TFvectorizer.fit_transform(unique_msgs)
all_msgs_vect = TFvectorizer.transform(all_msgs)
In [1]: type(all_msgs_vect)
Out[1]: scipy.sparse.csr.csr_matrix
In [2]: all_msgs_vect.shape
Out[2]: (5000, 529)
# Method 1
In [3]: start = datetime.now()
...: print(datetime.now())
...: msg_CosSim = cosine_similarity(all_msgs_vect)
...: print('Method 1 took', datetime.now() - start)
2019-09-09 10:44:33.039660
Method 1 took 0:00:00.117537
# Method 2
In [4]: start = datetime.now()
...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
...: square_mag = np.diag(similarity)
...: inv_square_mag = 1 / square_mag
...: inv_square_mag[np.isinf(inv_square_mag)] = 0
...: inv_mag = np.sqrt(inv_square_mag)
...: cosine = similarity * inv_mag
...: msg_CosSim2 = cosine.T * inv_mag
...: print('Method 2 took', datetime.now() - start)
Method 2 took 0:00:08.399767
__main__:4: RuntimeWarning: divide by zero encountered in true_divide
你知道为什么他提出的方法与我的数据在zbinsds中的例子不同,实际上速度较慢吗?你知道我如何利用闲置的75%的CPU资源吗?你知道吗
In [2]: all_msgs_vect.shape
Out[2]: (1063867, 3128)
In [3]: start = datetime.now()
...: msg_CosSim = cosine_similarity(all_msgs_vect)
...: print('Method 1 took', datetime.now() - start)
...:
2019-09-09 11:13:53.808270
Traceback (most recent call last):
File "<ipython-input-3-11dcc36bb82a>", line 3, in <module>
msg_CosSim = cosine_similarity(all_msgs_vect)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\metrics\pairwise.py", line 925, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
ret = a * b
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
return self._mul_sparse_matrix(other)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=idx_dtype)
MemoryError
In [4]: start = datetime.now()
...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
...: square_mag = np.diag(similarity)
...: inv_square_mag = 1 / square_mag
...: inv_square_mag[np.isinf(inv_square_mag)] = 0
...: inv_mag = np.sqrt(inv_square_mag)
...: cosine = similarity * inv_mag
...: msg_CosSim2 = cosine.T * inv_mag
...: print('Method 2 took', datetime.now() - start)
Traceback (most recent call last):
File "<ipython-input-4-070750736bc5>", line 2, in <module>
similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
MemoryError
你知道我如何利用所有可用的内存来处理这些数据量吗?我有一种模糊的感觉,在“方法2”中.toarray()是一个问题,但是如何避免它呢?简单地去掉它并不能解决内存问题,而且我也不确定dot natrix计算在这种情况下是否仍能正常工作:
In [5]: similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
Traceback (most recent call last):
File "<ipython-input-5-e006c93b9bfd>", line 1, in <module>
similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
return self._mul_sparse_matrix(other)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=idx_dtype)
MemoryError
我希望我给了足够的信息,我的原始数据,因为我不能真正上传到这里,但如果没有,请让我知道!非常感谢您的任何意见。。。你知道吗
谢谢你, 标记
目前没有回答
相关问题 更多 >
编程相关推荐