从Pandas DataFram创建Spark DataFram

2024-10-01 11:30:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个简单的Pandas数据帧构建一个Spark数据帧。这是我遵循的步骤。在

import pandas as pd
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = sqlContext.createDataFrame(pandas_df)
spark_df.printSchema()

直到现在一切都好。输出为:

root
|-- Letters: string (nullable = true)

当我试图打印数据帧时,问题出现了:

^{pr2}$

结果是:

An error occurred while calling o158.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
Error executing Jupyter command 'pyspark.daemon': [Errno 2] No such file or directory PYTHONPATH was:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException: No port number in pyspark.daemon's stdout

以下是我的Spark规格:

SparkSession-蜂巢

SparkContext

Spark用户界面

版本: v2.4.0版

主人: 本地[*]

应用程序名: PySparkShell公司

这是我的价值:

export PYSPARK_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS='lab'

事实:

正如错误所提到的,它与从Jupyter运行pyspark有关。用“PYSPARK_PYTHON=python2.7”和“PYSPARK_PYTHON=python3.6”运行它很好


Tags: 数据orgpandasdfhomebinapachestage
1条回答
网友
1楼 · 发布于 2024-10-01 11:30:42

导入并初始化findspark,创建一个spark会话,然后使用该对象将pandas数据帧转换为spark数据帧。然后将新的spark数据帧添加到目录中。使用python3.6.6在Jupiter 5.7.2和Spyder 3.3.2中测试并运行。在

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
import pandas as pd

# Create a spark session
spark = SparkSession.builder.getOrCreate()

# Create pandas data frame and convert it to a spark data frame 
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = spark.createDataFrame(pandas_df)

# Add the spark data frame to the catalog
spark_df.createOrReplaceTempView('spark_df')

spark_df.show()
+   -+
|Letters|
+   -+
|      X|
|      Y|
|      Z|
+   -+

spark.catalog.listTables()
Out[18]: [Table(name='spark_df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

相关问题 更多 >