Pypark芹菜任务:toPandas()抛出酸洗

2024-10-02 22:24:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个web应用程序可以在pyspark中运行长时间运行的任务。我使用Django和celeri异步运行这些任务。在

我有一段代码,当我在控制台中执行它时,它工作得很好。但是当我运行它来完成芹菜任务时,我得到了很多错误。 首先,我的自定义项因为某种原因不起作用。我把它放在一个try-except块中,它总是进入except块。在

try:
    func = udf(lambda x: parse(x), DateType())
    spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
    raise ValueError("No valid date format found.")

错误:

^{pr2}$

此外,我使用toPandas()来转换数据帧并在其上运行一些pandas函数,但它抛出以下错误:

[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
Traceback (most recent call last):
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
    data_frame_new = data_frame_1.toPandas()
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
    if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
    return self.sparkSession.conf.get(key, defaultValue)
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
    return self._jconf.get(key)
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
    put((READY, (job, i, result, inqW_fd)))
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
    self.send_payload(ForkingPickler.dumps(obj))
  File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
    cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.

Tags: inpydevselfhomegetvenvlib
3条回答

这是行不通的。Spark使用复杂的状态,包括JVM状态,它不能简单地序列化并发送给worker。如果要异步运行代码,请使用线程池提交作业。在

我在回答我自己的问题。可能是Pypark2.3的错误 我使用的是pyspark2.3.0,由于某些原因它不能很好地与python3.5配合使用。 我降级到Pyspark 2.1.2,一切正常。在

我遇到了这个问题,很难解决它。事实证明,如果正在运行的Spark版本与执行它的PySpark版本不匹配,就会发生此错误。在我的例子中,我运行的是spark2.2.3.4,并试图使用pyspark2.4.4。在我将PySpark降级到2.2.3之后,问题就消失了。我遇到了另一个问题,它是在2.2.3之后添加的PySpark中使用功能的代码引起的,但这是另一个问题。在

相关问题 更多 >