PythonRDD和ParallelCollectionRDD有什么区别 - 问答 - Python中文网

PythonRDD和ParallelCollectionRDD有什么区别

2024-10-02 22:37:08 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我正在学习如何用Python编写Spark程序，并解决一个问题。在

问题是我有一个PythonRDD作为id和description加载：

pythonRDD.take(1)
## [('b000jz4hqo', ['clickart', '950', '000', 'premier', 'image', 'pack', 'dvd', 'rom', 'broderbund'])]

以及作为id和description加载的ParallelCollectionRDD：

^{pr2}$

我可以像这样数一数：

paraRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])

或者干脆

paraRDD.reduce(lambda a,b: len(a[1]) + len(b[1]))

但在Python身上它遇到了虫子，虫子说：

"TypeError: 'int' object has no attribute 'getitem'".

def countTokens(vendorRDD):
    return vendorRDD.map(lambda l: (l[0],len(l[1]))).reduce(lambda a,b: a[1] + b[1])

你知道这是怎么发生的吗？！在

Tags： lambda 程序 id map reduce len description spark

1条回答

网友

1楼 · 发布于 2024-10-02 22:37:08

PythonRDD和{}之间的区别在这里完全无关。你的代码就是错了。在

reduce方法采用具有以下签名的关联和交换函数：

^{1}$

换句话说，参数和返回的对象必须是同一类型的，操作顺序和括号不能影响最终结果。传递给reduce的函数根本不满足这些条件。在

要想让它发挥作用，你需要这样的东西：

^{pr2}$

甚至更好：

from operator import add

rdd.values().map(len).reduce(add)

相关问题更多 >

编程相关推荐

热门问题

热门文章