有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

为什么要读取Java中的Spark textFile。gz文件比在Scala中使用spark shell慢得多

我使用sparkshell(spark1.6.2)运行了一个非常简单的查询来计算一堆数据上的行数。gz文件如下:

sc.textFile("s3a://bucket-name/prod/data-source/2017-05-05/*/").count

2017-06-06 21:33:21 INFO SparkContext:58 - Created broadcast 2 from textFile at :28 2017-06-06 21:33:24 INFO FileInputFormat:249 - Total input paths to process : 1700 2017-06-06 21:33:24 INFO SparkContext:58 - Starting job: count at :28 2017-06-06 21:33:24 INFO SparkContext:58 - Created broadcast 3 from broadcast at DAGScheduler.scala:1006 res1: Long = 433733191

返回#需要不到10秒的时间

但是,当我在测试应用程序中使用Java编写相同的逻辑时,大约需要3分钟的时间:

new SparkContextFactory(appName).textFile("s3a://bucket-name/prod/data-source/2017-05-05/*/").count()

以下是输出打印输出的片段:

2017-06-06 22:40:20 INFO Executor:58 - Running task 195.0 in stage 0.0 (TID 195) 2017-06-06 22:40:20 INFO HadoopRDD:58 - Input split: s3a://bucket-name/prod/data-source/2017-05-05/20170505_1493980530/part-r-00095.gz:0+3732508 2017-06-06 22:40:20 INFO CodecPool:181 - Got brand-new decompressor [.gz] 2017-06-06 22:40:21 INFO Executor:58 - Finished task 194.0 in stage 0.0 (TID 194). 2336 bytes result sent to driver 2017-06-06 22:40:21 INFO Executor:58 - Running task 196.0 in stage 0.0 (TID 196) 2017-06-06 22:40:21 INFO HadoopRDD:58 - Input split: s3a://bucket-name/prod/data-source/2017-05-05/20170505_1493980530/part-r-00096.gz:0+3727204 2017-06-06 22:40:21 INFO CodecPool:181 - Got brand-new decompressor [.gz] 2017-06-06 22:40:22 INFO Executor:58 - Finished task 195.0 in stage 0.0 (TID 195). 2336 bytes result sent to driver 2017-06-06 22:40:22 INFO Executor:58 - Running task 197.0 in stage 0.0 (TID 197) 2017-06-06 22:40:22 INFO HadoopRDD:58 - Input split: s3a://bucket-name/prod/data-source/2017-05-05/20170505_1493980530/part-r-00097.gz:0+3734183 2017-06-06 22:40:22 INFO CodecPool:181 - Got brand-new decompressor [.gz] 2017-06-06 22:40:22 INFO Executor:58 - Finished task 196.0 in stage 0.0 (TID 196). 2336 bytes result sent to driver

s3a://bucket name/prod/data source/2017-05-05/下有多个子文件夹,每个子文件夹有数百个。gz文件,每个大约4MB

让我困惑的是,为什么在这两种方式下会有如此多的时间差异? 如果有人能分享一些灯光,我会很感激的


共 (0) 个答案