Spark数据帧到数据分析

2024-09-29 23:30:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用pandas评测库实现数据评测。我直接从配置单元获取数据。这就是我收到的错误

Py4JJavaError: An error occurred while calling o114.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 14.0 failed 4 times, most recent failure: Lost task 2.3 in stage 14.0 (TID 65, bdgtr026x30h4.nam.nsroot.net, executor 11): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 15823824. To avoid this, increase spark.kryoserializer.buffer.max value.

我试图用python在jupyter笔记本上设置spark,但我收到了相同的错误

spark.conf.set("spark.kryoserializer.buffer.max", "512")
spark.conf.set('spark.kryoserializer.buffer.max.mb', 'val')

根据我的代码,我正在执行任何步骤吗

df = spark.sql('SELECT id,acct from tablename').cache()
report = ProfileReport(df.toPandas())

Tags: inorgdffailureapacheconfbuffer错误
1条回答
网友
1楼 · 发布于 2024-09-29 23:30:43

不要在jupyter中设置配置,而是在创建spark会话时设置配置,因为一旦创建了会话,配置就不会更改

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.kryoserializer.buffer.max", "512m") \
.config('spark.kryoserializer.buffer', '512k') \
.getOrCreate()

您可以获取属性详细信息here

相关问题 更多 >

    热门问题