聚结降低整个级的平行度（火花）

val df = sparkContext.parallelize((1 to 500).map(i=> scala.util.Random.nextDouble),100).toDF("value") val expensiveUDF = udf((d:Double) => {Thread.sleep(100);d}) val df_result = df .withColumn("udfResult",expensiveUDF($"value")) df_result .coalesce(1) .saveAsTable(tablename)

1条回答

网友

1楼 · 发布于 2024-10-01 09:33:54

其实并不是因为SparkSQL的优化，SparkSQL并没有改变Coalesce操作符的位置，如执行的计划所示：

Coalesce 1
+- *Project [value#2, UDF(value#2) AS udfResult#11]
   +- *SerializeFromObject [input[0, double, false] AS value#2]
      +- Scan ExternalRDDScan[obj#1]

我引用coalesce API的描述中的一段话：

注：本段由jira SPARK-19399添加。所以它不应该在2.0的API中找到。在

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

coalesceapi不执行shuffle，但导致以前的RDD和当前的RDD之间的依赖性很小。由于RDD是延迟求值，因此计算实际上是通过合并分区完成的。在

为了防止这种情况发生，您应该使用重新分区API。在

相关问题更多 >

编程相关推荐

热门问题

热门文章