Pypark数据帧操作

datafile_csv = "test.csv" def process_csv(abspath, sparkcontext): sqlContext = SQLContext (sparkcontext) df = sqlContext.read.load (os.path.join (abspath, datafile_csv), format='com.databricks.spark.csv', inferSchema='true') df.registerTempTable("currency") print "Dataframe:" display(df) // Don't know what to do here ???? reshaped_df = df.groupby('_c0') display(reshaped_df) if __name__ == "__main__": abspath = os.path.abspath(os.path.dirname(__file__)) conf = (SparkConf () . setMaster("local[20]") . setAppName("Currency Parser") . set("spark.executor.memory", "2g")) sc = SparkContext(conf=conf) process_csv (abspath, sc)

1条回答

网友

1楼 · 发布于 2024-09-30 22:27:23

你在问两个问题。第一个问题是正确加载CSV的ETL问题，最好在pandas（由于您的数据结构非常狭窄）中完成，例如：

import pandas as pd
from pyspark.sql import SparkSession
from io import StringIO

spark = SparkSession.builder.getOrCreate()
TESTDATA = StringIO("""AHeader AValue, BHeader BValue, CHeader CValue""")

pandas_df = pd.read_csv(TESTDATA,  # replace with path to your csv
                        delim_whitespace=True,
                        lineterminator=",",
                        header=None,
                        names=['col1', 'col2'])
spark_df = spark.createDataFrame(pandas_df)
spark_df.show()

+   -+   +
|   col1|  col2|
+   -+   +
|AHeader|AValue|
|BHeader|BValue|
|CHeader|CValue|
+   -+   +

第二个问题是关于spark中的轴心。当pandas.read_csv()将它放入您要求的形状时，如果您需要进一步的整形，请看这里：http://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html?highlight=pivot#pyspark.sql.GroupedData.pivot

相关问题更多 >

编程相关推荐

热门问题

热门文章