回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我有一个pyspark rdd,它可以收集为元组列表,如下所示:</p>
<pre><code>rdds = self.sc.parallelize([(("good", "spark"), 1), (("sood", "hpark"), 1), (("god", "spak"), 1),
(("food", "spark"), 1), (("fggood", "ssspark"), 1), (("xd", "hk"), 1),
(("good", "spark"), 7), (("good", "spark"), 3), (("good", "spark"), 4),
(("sood", "hpark"), 5), (("sood", "hpark"), 7), (("xd", "hk"), 2),
(("xd", "hk"), 1), (("fggood", "ssspark"), 2), (("fggood", "ssspark"), 1)], 6)
rdds.glom().collect()
def inner_map_1(p):
d = defaultdict(int)
for row in p:
d[row[0]] += row[1]
for item in d.items():
yield item
rdd2 = rdds.partitionBy(4, partitionFunc=lambda x: hash(x)).mapPartitions(inner_map_1)
print(rdd2.glom().collect())
def inner_map_2(p):
for row in p:
item = row[0]
sums = sum([num for _, num in row[1]])
yield item, sums
rdd3 = rdds.groupBy(lambda x: x[0]).mapPartitions(inner_map_2)
print(rdd3.glom().collect())
</code></pre>
<p>rdd2和rdd3是以不同的形式计算的,我得到了相同的结果,但我不确定rdd2和rdd3得到的结果是否相同,元素是否在同一个分区中</p>