Spark Udf需要时间运行

2024-10-03 11:19:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个余弦相似性的spark自定义函数。在

def cosineSimilarity(df):
    """ Cosine similarity of the each document with other

    """

    from pyspark.sql.functions import udf
    from pyspark.sql.types import DoubleType
    from scipy.spatial import distance

    cosine = udf(lambda v1, v2: (
     float(1-distance.cosine(v1, v2)) if v1 is not None and v2 is not None else None),
     DoubleType())

    # Creating a cross product of the table to get the cosine similarity vectors 

    crosstabDF=df.withColumnRenamed('id','id_1').withColumnRenamed('w2v_vector','w2v_vector_1')\
    .join(df.withColumnRenamed('id','id_2').withColumnRenamed('w2v_vector','w2v_vector_2'))

    similardocsDF= crosstabDF.withColumn('cosinesim', cosine("w2v_vector_1","w2v_vector_1"))

    return similardocsDF

similardocsDF=cosineSimilarity(w2vdf.select('id','w2v_vector'))

similardocsDF.cache().take(4)

我的数据如下:

^{2}$

但是当我运行上面的代码时,它持续运行了2个小时,仍然没有完成。我在mac上安装的spark上运行过。总观测值为59K,文本文档的w2vec向量嵌入量为300维。在


Tags: thefromimportnoneiddfsparkv2