皮斯帕克。spark.Thread.executor.memoryOverhead和spark.executor.pyspark.memory之间的相关性

2024-09-28 20:59:22 发布

您现在位置:Python中文网/ 问答频道 /正文

在使用PySpark 2.4.0和MlFlow提供ml模型时,我遇到了一个问题

执行器失败,出现以下异常:

org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by query. Memory leaked: (2048) Allocator(stdin reader for ./my-job-impl-condaenv.tar.gz/bin/python) 0/2048/8194/9223372036854775807 (res/actual/peak/limit)

从关于PySpark的文章中,我了解到以下几点:

  1. spark为每个执行器的每个核心运行至少一个python进程
  2. spark.executor.memory参数只配置JVM内存限制,不影响python进程
  3. python工作进程从执行器开销中消耗内存,使用spark.yarn.executor.memoryOverhead配置
  4. 自spark 2.4.0以来,我们可以使用spark.executor.pyspark.memory显式地为python工作者保留内存,这允许我们更精确地规划内存,并停止使用spark.yarn.executor.memoryOverhead过度分配内存

以下是官方文件对spark.executor.pyspark.memory的解释:

The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit Python's memory use and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes. When PySpark is run in YARN or Kubernetes, this memory is added to executor resource requests.

起初,我只是使用spark.yarn.executor.memoryOverhead增加了内存量,错误终于消失了

然后我决定做得更好,并使用导致相同错误的spark.executor.pyspark.memory指定python worker的内存量

所以,似乎我没有正确理解spark.executor.pyspark.memory的确切结构以及它与spark.yarn.executor.memoryOverhead的关系

我对PySpark没有太多的经验,所以我希望您能帮助我理解PySpark中的内存分配过程,谢谢


Tags: to内存in进程is执行器sparkpyspark