二维lis行的Jaccard比较

2024-09-28 20:52:15 发布

您现在位置:Python中文网/ 问答频道 /正文

这是我当前用于Jaccard比较的代码,其中包含了描述。我有一种感觉,转换为numpy数组和向量化可能会加快速度,但我不确定如何最好地做到这一点。顺便说一下,输出数组中的许多值都是0,这意味着输出是一个稀疏矩阵。在

import numpy as np
#values of list1 can be anywhere between 1-25,000,000 (not all values are included) 
#I want to perform a jaccard comparison pairwise for each row of list1
list1=[[123123,34566,4634,3422],[236564,8543525,234234],
          [2356574,3453,23423,2342,234]...[12312,32523,345,345345234]]

#currently my code looks like this (and is quite slow for large list sizes):

def jaccard(x,y):

    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

def returnJaccard(cids):
    lenList = len(cids)
    jarr = np.empty([lenList,lenList])
    for ix in range(lenList):
        for jx in range(lenList):
            if(ix>jx):
                jc = jaccard(cids[ix],cids[jx])
                jarr[ix][jx] = jc
                jarr[jx][ix] = jc
    return jarr

#output is an n x n matrix where n = len(list1), all values should be between 0 and 1
jaccard_compare = returnJaccard(list1)

Tags: forlenixvaluesunionjxjcset