在pyspark中创建数据帧后，如何重命名所有列，根据模式转换数据类型/从csv文件读取

from pyspark.sql.types import * from pyspark.sql.types import StructField from pyspark.sql import types testdata = [("aaaa",1,50.0,"05-APR-2020"), ("bbbb",2,100.0,"06-APR-2020")] dataschema = types.StructType([ types.StructField('col1', types.StringType(), True), types.StructField('col2', types.IntegerType(), True), types.StructField('col3', types.DoubleType(), True), types.StructField('col4', types.DateType(), True) ]) testdf2 = spark.createDataFrame( spark.sparkContext.parallelize(testdata), dataschema ) testdf2.printSchema() testdf2.show()

1条回答

网友

1楼 · 发布于 2024-06-23 19:37:00

默认情况下，Spark不会将字符串转换为date type

我们需要使用datetime模块定义输入数据，然后在使用schema spark读取时创建col4到datetype

Example:

import datetime
from pyspark.sql.types import *
from pyspark.sql.types import StructField
from pyspark.sql import types
testdata = [("aaaa",1,50.0,datetime.datetime.strptime('05-APR-2020','%d-%b-%Y')),
            ("bbbb",2,100.0,datetime.datetime.strptime('06-APR-2020','%d-%b-%Y'))]

dataschema = types.StructType([
        types.StructField('col1', types.StringType(), True),
        types.StructField('col2', types.IntegerType(), True),
        types.StructField('col3', types.DoubleType(), True),
        types.StructField('col4', types.DateType(), True)
    ])

testdf2 = spark.createDataFrame(
          spark.sparkContext.parallelize(testdata),
          dataschema
          )
testdf2.printSchema()
#root
# |  col1: string (nullable = true)
# |  col2: integer (nullable = true)
# |  col3: double (nullable = true)
# |  col4: date (nullable = true)


testdf2.show()
#+  +  +  -+     +
#|col1|col2| col3|      col4|
#+  +  +  -+     +
#|aaaa|   1| 50.0|2020-04-05|
#|bbbb|   2|100.0|2020-04-06|
#+  +  +  -+     +

另一种方法是为col4定义stringtype，然后使用to_date函数转换为date

dataschema = types.StructType([
        types.StructField('col1', types.StringType(), True),
        types.StructField('col2', types.IntegerType(), True),
        types.StructField('col3', types.DoubleType(), True),
        types.StructField('col4', types.StringType(), True)
    ])

testdata = [("aaaa",1,50.0,"05-APR-2020"),
               ("bbbb",2,100.0,"06-APR-2020")]

spark.createDataFrame(testdata,dataschema).withColumn("col4",to_date(col("col4"),"dd-MMM-yyyy")).printSchema()
#root
# |  col1: string (nullable = true)
# |  col2: integer (nullable = true)
# |  col3: double (nullable = true)
# |  col4: date (nullable = true)

spark.createDataFrame(testdata,dataschema).withColumn("col4",to_date(col("col4"),"dd-MMM-yyyy")).show()
#+  +  +  -+     +
#|col1|col2| col3|      col4|
#+  +  +  -+     +
#|aaaa|   1| 50.0|2020-04-05|
#|bbbb|   2|100.0|2020-04-06|
#+  +  +  -+     +

相关问题更多 >

编程相关推荐

热门问题

热门文章