回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我是pyspark的新手,我有一个脚本如下</p>
<pre><code> joinedRatings=ratings.join(ratings)
joinedRatings.take(4)
</code></pre>
<p>输出为</p>
<pre><code>[(196, ((242, 3.0), (242, 3.0))), (196, ((242, 3.0), (393, 4.0))), (196, ((242, 3.0), (381, 4.0))), (196, ((242, 3.0), (251, 3.0)))]
</code></pre>
<p>在那之后,我就有了功能,那就是</p>
<pre><code>def filterDuplicates(userRatings):
ratings = userRatings[1]
(movie1, rating1) = ratings[0]
(movie2, rating2) = ratings[1]
return movie1 < movie2
</code></pre>
<p>我有这个RDD</p>
<pre><code> uniqueJoinedRatings = joinedRatings.filter(filterDuplicates)
</code></pre>
<p>我的问题是理解如何运行我编写的这个函数</p>
<pre><code> joinedRatings[1]
</code></pre>
<p>我收到的错误是</p>
<pre><code> Fail to execute line 1: joinedRatings[1]
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-240579357005199320.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 1, in <module>
TypeError: 'PipelinedRDD' object does not support indexing
</code></pre>
<p>但是它是在“def filterDuplicates(userRatings):”函数下运行的,没有任何问题,请告诉我如何学习“joinedRatings[1]”的值?你知道吗</p>