擅长:python、mysql、java
<p>您可以直接在datframe上使用新的pandas<a href="https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#grouped-map" rel="nofollow noreferrer">grouped udf</a>,而不是<code>rdd.mapPartitions</code>。函数本身接受一个组作为pandas df并返回pandas df。你知道吗</p>
<p>当它与spark dataframe apply api一起使用时,spark会自动将分区的数据帧组合成一个新spark
数据帧。你知道吗</p>
<pre><code># a grouped pandas_udf receives the whole group as a pandas dataframe
# it must also return a pandas dataframe
# the first schema string parameter must describe the return dataframe schema
# in this example the result dataframe contains 2 columns id and value
@pandas_udf("id long, value double", PandasUDFType.GROUPED_MAP)
def some_function(pdf):
return pdf.apply(some_pdf_func)
df.groupby(df.partition_key).apply(some_function).show()
</code></pre>