Spark Java错误:大小超过Integer.MAX_VALUE

2024-09-27 01:19:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用spark来完成一些简单的机器学习任务。 我用pyspark和spark 1.2.0做了一个简单的logistic回归问题。 我有120万张训练记录,我对这些记录的特点进行了分析。 当我将散列特征数设置为1024时,程序工作正常,但当我将散列特征数设置为16384时,程序多次失败,并出现以下错误:

Py4JJavaError: An error occurred while calling o84.trainLogisticRegressionModelWithSGD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 9, workernode0.sparkexperience4a7.d5.internal.cloudapp.net): java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
    at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
    at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
    at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
    at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307)
    at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
    at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
    at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:124)
    at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:97)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:91)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at java.lang.Thread.run(Thread.java:745)

    at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:156)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:93)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:44)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

当我在将数据传输到LabeledPoint之后使用gd训练logisticRegression时,会发生此错误。

有人知道吗?

我的代码如下(我使用的是IPython笔记本):

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithSGD
from numpy import array
from sklearn.feature_extraction import FeatureHasher
from pyspark import SparkContext
sf = SparkConf().setAppName("test").set("spark.executor.memory", "50g").set("spark.cores.max", 30)
sc = SparkContext(conf=sf)
training_file = sc.textFile("train_small.txt")
def hash_feature(line):
    values = [0, dict()]
    for index, x in enumerate(line.strip("\n").split('\t')):
        if index == 0:
            values[0] = float(x)
        else:
            values[1][str(index)+"_"+x] = 1
    return values
n_feature = 2**14
hasher = FeatureHasher(n_features=n_feature)
training_file_hashed = training_file.map(lambda line: [hash_feature(line)[0], hasher.transform([hash_feature(line)[1]])])
def build_lable_points(line):
    values = [0.0] * n_feature
    for index, value in zip(line[1].indices, line[1].data):
        values[index] = value
    return LabeledPoint(line[0], values)
parsed_training_data = training_file_hashed.map(lambda line: build_lable_points(line))
model = LogisticRegressionWithSGD.train(parsed_training_data)

执行最后一行时出错。


Tags: ioorgapachelinechannelnetworkjavaat
3条回答

在某些情况下,它会尝试存储功能,而1.2M*16384大于Integer.MAX_INT,因此您试图存储的功能超过Spark支持的最大大小。

你可能会遇到Apache Spark的极限。

增加分区数可能会导致Active tasks is a negative number in Spark UI,这可能意味着分区数太高。

Integer.MAX_INT限制是存储文件的大小。1.2米的行数不是什么大事,我不确定你的问题是“火花的极限”。更可能的是,你的工作的某些部分是创造一些太大的东西,任何给定的执行者都无法处理。

我不是Python程序员,但是当您“散列记录的特性”时,您可能会为一个示例获取一组非常稀疏的记录,并创建一个非稀疏数组。这意味着16384个功能需要大量内存。尤其是当你这样做的时候。唯一没有让你内存不足的原因是你似乎已经配置了大量的内存(50G)。

另一个可能有用的方法是增加分区。因此,如果不能使行使用更少的内存,至少可以尝试在任何给定任务上使用更少的行。创建的任何临时文件都可能依赖于此,因此您不太可能达到文件限制。


而且,与错误完全无关,但与你试图做的事情相关:

16384确实是一个很大的特性,在乐观的情况下,每个特性都只是一个布尔特性,您总共有2^16384个可能的排列可以学习,这是一个很大的数字(请在这里尝试:https://defuse.ca/big-number-calculator.htm)。

很有可能没有一种算法能够在只有120万个样本的情况下学习决策边界,您可能需要至少几万亿个样本才能在这样的特征空间上做出让步。机器学习有它的局限性,所以如果你不能得到比随机精度更好的结果,不要惊讶。

我绝对建议你先尝试一下降维!!

相关问题 更多 >

    热门问题