回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>向程序员同事问好</p>
<p>我最近开始与pyspark合作,来自熊猫的背景。我需要计算数据中用户之间的相似性。由于我无法从pyspark中找到,我求助于使用python字典来创建一个相似性数据框</p>
<p>但是,我没有办法将嵌套字典转换为pyspark数据帧。
你能为我提供一个实现这一预期结果的方向吗</p>
<pre><code>import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from scipy.spatial import distance
spark = SparkSession.builder.getOrCreate()
from pyspark.sql import *
traindf = spark.createDataFrame([
('u11',[1, 2, 3]),
('u12',[4, 5, 6]),
('u13',[7, 8, 9])
]).toDF("user","rating")
traindf.show()
</code></pre>
<p>输出</p>
<pre><code>+----+---------+
|user| rating|
+----+---------+
| u11|[1, 2, 3]|
| u12|[4, 5, 6]|
| u13|[7, 8, 9]|
+----+---------+
</code></pre>
<p>它希望在用户之间生成相似性,并将其放在pyspark数据帧中</p>
<pre><code>parent_dict = {}
for parent_row in traindf.collect():
# print(parent_row['user'],parent_row['rating'])
child_dict = {}
for child_row in traindf.collect():
similarity = distance.cosine(parent_row['rating'],child_row['rating'])
child_dict[child_row['user']] = similarity
parent_dict[parent_row['user']] = child_dict
print(parent_dict)
</code></pre>
<p>输出:</p>
<pre><code>{'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}
</code></pre>
<p>从这个字典中,我想构造一个pyspark数据帧</p>
<pre><code>+-----+-----+--------------------+
|user1|user2| similarity|
+-----+-----+--------------------+
| u11| u11| 0.0|
| u11| u12| 0.0253681538029239|
| u11| u13| 0.0405880544333298|
| u12| u11| 0.0253681538029239|
| u12| u12| 0.0|
| u12| u13|0.001809107314273195|
| u13| u11| 0.0405880544333298|
| u13| u12|0.001809107314273195|
| u13| u13| 0.0|
+-----+-----+--------------------+
</code></pre>
<p>到目前为止,我尝试将dict转换为pandas数据帧,并将其转换为pyspark数据帧。然而,我需要大规模地做这件事,我正在寻找更具火花的方式来做这件事</p>
<pre><code>parent_user = []
child_user = []
child_similarity = []
for parent_row in traindf.collect():
for child_row in traindf.collect():
similarity = distance.cosine(parent_row['rating'],child_row['rating'])
child_user.append(child_row['user'])
child_similarity.append(similarity)
parent_user.append(parent_row['user'])
my_dict = {}
my_dict['user1'] = parent_user
my_dict['user2'] = child_user
my_dict['similarity'] = child_similarity
import pandas as pd
pd.DataFrame(my_dict)
df = spark.createDataFrame(pd.DataFrame(my_dict))
df.show()
</code></pre>
<p>输出:</p>
<pre><code>+-----+-----+--------------------+
|user1|user2| similarity|
+-----+-----+--------------------+
| u11| u11| 0.0|
| u11| u12| 0.0253681538029239|
| u11| u13| 0.0405880544333298|
| u12| u11| 0.0253681538029239|
| u12| u12| 0.0|
| u12| u13|0.001809107314273195|
| u13| u11| 0.0405880544333298|
| u13| u12|0.001809107314273195|
| u13| u13| 0.0|
+-----+-----+--------------------+
</code></pre>