PySpark的reduceByKey没有按预期工作

from pyspark import SparkConf, SparkContext APP_NAME = 'Test App' def main(sc): test = [(0, [i]) for i in xrange(100)] test = sc.parallelize(test) test = test.reduceByKey(method) print test.collect() def method(x, y): x.append(y[0]) return x if __name__ == '__main__': # Configure Spark conf = SparkConf().setAppName(APP_NAME) conf = conf.setMaster('local[*]') sc = SparkContext(conf=conf) main(sc)

1条回答

网友

1楼 · 发布于 2024-10-06 14:20:45

首先，看起来你实际上想要groupByKey而不是reduceByKey：

rdd = sc.parallelize([(0, i) for i in xrange(100)])
grouped = rdd.groupByKey()
k, vs = grouped.first()
assert len(list(vs)) == 100

Could someone please help me understand why this output is being generated?

reduceByKeyassumes即{}是associative，而你的{}显然不是。根据操作顺序，输出是不同的。假设您从某个密钥的以下数据开始：

^{pr2}$

现在添加一些括号：

((([1], [2]), [3]), [4])
(([1, 2], [3]), [4])
([1, 2, 3], [4])
[1, 2, 3, 4]

和另一组括号

(([1], ([2], [3])), [4])
(([1], [2, 3]), [4])
([1, 2], [4])
[1, 2, 4]

当你重写如下：

method = lambda x, y: x + y

或者干脆

from operator import add
method = add

你得到了一个关联函数，它按预期工作。在

一般来说，对于reduce*操作，您需要既有关联又有commutative的函数。在

相关问题更多 >

编程相关推荐

热门问题

热门文章