大数据集无记忆误差余弦相似性的快速计算方法

2024-09-28 16:22:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要在一个平面上计算余弦相似性scipy.sparse.csr文件.csr\u矩阵。使用最直接的sklearn实现,我遇到了较大矩阵形状的内存错误。即使是较小的形状,性能也不是很好,CPU负载也不会超过25%。所以我想让它更快更稳定,也适用于更大的数据集。你知道吗

我发现了一个非常棒的resource about the speed issue,我删除了最初文章中除最快版本之外的所有版本,并将我的简单sklearn实现添加为“方法2”。我确认在我的机器(32gbram,WIN10,python3.6.4)上,“方法1”只运行了“方法2”使用代码中构造的数据集所需的大约4%的时间。下面是改编自zbinsd的代码:

# Code adopted from zbinsd @ https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print("Input data shape:", Asp.shape)

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        similarity = np.dot(A, A.T)
        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag
        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0
        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)
        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 2:
        return cosine_similarity(Asp)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 3):
    if m in [2]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print("Method 1")
%timeit calc_sim(A, method=1)
print("Method 2")
%timeit calc_sim(A, method=2)

我还找到了一个好的resource about the memory issue 结果发现我已经考虑了icm's建议,只使用了唯一的引用,所以不知道如何进一步改进。你知道吗

移动到源于sklearn计数矢量器的原始数据

TFvectorizer = CountVectorizer(lowercase=False, tokenizer=log_tokenizer, ngram_range=(1,1))
TF = TFvectorizer.fit_transform(unique_msgs)
all_msgs_vect = TFvectorizer.transform(all_msgs)

我还有两个问题:

问题#1:对于我的原始数据集的一个小样本,方法1比方法2快,但这两种方法实际使用的CPU资源都不超过25%

In [1]: type(all_msgs_vect)
Out[1]: scipy.sparse.csr.csr_matrix

In [2]: all_msgs_vect.shape
Out[2]: (5000, 529)


# Method 1
In [3]: start = datetime.now()
   ...: print(datetime.now())
   ...: msg_CosSim = cosine_similarity(all_msgs_vect)
   ...: print('Method 1 took', datetime.now() - start)
2019-09-09 10:44:33.039660
Method 1 took 0:00:00.117537

# Method 2
In [4]: start = datetime.now()
   ...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
   ...: square_mag = np.diag(similarity)
   ...: inv_square_mag = 1 / square_mag
   ...: inv_square_mag[np.isinf(inv_square_mag)] = 0
   ...: inv_mag = np.sqrt(inv_square_mag)
   ...: cosine = similarity * inv_mag
   ...: msg_CosSim2 = cosine.T * inv_mag
   ...: print('Method 2 took', datetime.now() - start)
Method 2 took 0:00:08.399767
__main__:4: RuntimeWarning: divide by zero encountered in true_divide

你知道为什么他提出的方法与我的数据在zbinsds中的例子不同,实际上速度较慢吗?你知道我如何利用闲置的75%的CPU资源吗?你知道吗

问题#2:对于我的原始数据的大样本,我遇到了两种方法的内存错误,其中“方法1”从未超过约20%的内存负载,“方法2”在产生错误之前很快达到约60%的峰值

In [2]: all_msgs_vect.shape
Out[2]: (1063867, 3128)

In [3]: start = datetime.now()
   ...: msg_CosSim = cosine_similarity(all_msgs_vect)
   ...: print('Method 1 took', datetime.now() - start)
   ...: 
2019-09-09 11:13:53.808270
Traceback (most recent call last):

  File "<ipython-input-3-11dcc36bb82a>", line 3, in <module>
    msg_CosSim = cosine_similarity(all_msgs_vect)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\metrics\pairwise.py", line 925, in cosine_similarity
    K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
    ret = a * b

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
    return self._mul_sparse_matrix(other)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=idx_dtype)

MemoryError


In [4]: start = datetime.now()
   ...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
   ...: square_mag = np.diag(similarity)
   ...: inv_square_mag = 1 / square_mag
   ...: inv_square_mag[np.isinf(inv_square_mag)] = 0
   ...: inv_mag = np.sqrt(inv_square_mag)
   ...: cosine = similarity * inv_mag
   ...: msg_CosSim2 = cosine.T * inv_mag
   ...: print('Method 2 took', datetime.now() - start)
Traceback (most recent call last):

  File "<ipython-input-4-070750736bc5>", line 2, in <module>
    similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)

MemoryError

你知道我如何利用所有可用的内存来处理这些数据量吗?我有一种模糊的感觉,在“方法2”中.toarray()是一个问题,但是如何避免它呢?简单地去掉它并不能解决内存问题,而且我也不确定dot natrix计算在这种情况下是否仍能正常工作:

In [5]: similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
Traceback (most recent call last):

  File "<ipython-input-5-e006c93b9bfd>", line 1, in <module>
    similarity = np.dot(all_msgs_vect, all_msgs_vect.T)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
    return self._mul_sparse_matrix(other)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=idx_dtype)

MemoryError

我希望我给了足够的信息,我的原始数据,因为我不能真正上传到这里,但如果没有,请让我知道!非常感谢您的任何意见。。。你知道吗

谢谢你, 标记


Tags: 方法innpallcondamethodsparsesquare