擅长:python、mysql、java
<pre><code>from pyspark.sql.types import *
import pyspark.sql.functions as psf
def cos_sim(a,b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
dot_udf = psf.udf(lambda x,y: cos_sim(x,y), FloatType())
data.alias("i").join(data.alias("j"), psf.col("i.user") != psf.col("j.user"))\
.select(
psf.col("i.user").alias("user1"),
psf.col("j.user").alias("user2"),
dot_udf("i.rating", "j.rating").alias("similarity"))\
.sort("similarity")\
.show()
</code></pre>
<p>输出符合要求:</p>
<pre><code>+ -+ -+ +
|user1|user2|similarity|
+ -+ -+ +
| u11| u12|0.70710677|
| u13| u11|0.70710677|
| u11| u13|0.70710677|
| u12| u11|0.70710677|
| u12| u13| 1.0|
| u13| u12| 1.0|
+ -+ -+ +
</code></pre>