如何更改Pyspark中fileoutputcommitter算法的版本

1条回答

网友

1楼 · 发布于 2024-10-04 01:33:31

but pyspark still writes the data in S3 using version 1(temporary folders are creating).

首先，v1和v2算法都使用临时文件。如MAPREDUCE-6336中所述

Algorithm version 2 changes the behavior of commitTask, recoverTask, and commitJob.
commitTask renames all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/
recoverTask is a nop strictly speaking, but for upgrade from version 1 to version 2 case, it checks if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and renames them to $joboutput/
commitJob deletes $joboutput/_temporary and writes $joboutput/_SUCCESS

因此，请确保您实际看到的是与v1而不是v2相对应的更改

另外spark.hadoop选项适用于上下文而不是特定的写操作，因此您的第三次尝试应该根本不起作用

其余的应该是等效的（第二个，如果在SparkContext启动之前设置）

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何更改Pyspark中fileoutputcommitter算法的版本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >