Pyspark:如何读取带有时间戳的csv文件?

2024-09-27 07:18:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个.csv表,如下所示

作为pd进口熊猫 df=pd.read\u csv('myFile.csv') 测向头(3)

            identifier                 identifier_type         timestamp           device_lat   device_lon
0   68d62a1b-b928-4225-b445-9607415905b3    gaid         2020-03-19 03:03:00 UTC    44.808169   -73.522956
1   1675a629-a010-44b6-98a9-72d04793821f    gaid         2020-03-18 21:15:42 UTC    42.103894   -76.799164
2   0fe7a0b7-028e-459e-b5d8-b59d31800b8e    gaid         2020-03-18 23:39:54 UTC    43.182028   -77.672017

我正在用pyspark读它

schema= StructType([
        StructField("identifier", StringType(), True),
        StructField("identifier_type", StringType(), True),
        StructField("timestamp", DateType(), True),
        StructField("device_lat", FloatType(), True),
        StructField("device_lon", FloatType(), True)])

myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv') 
myTable = myTable[myTable['device_lat']>0]
myTable.show(3)

    +--------------------+---------------+----------+----------+----------+
|          identifier|identifier_type| timestamp|device_lat|device_lon|
+--------------------+---------------+----------+----------+----------+
|68d62a1b-b928-422...|           gaid|2020-03-19|  44.80817| -73.52296|
|1675a629-a010-44b...|           gaid|2020-03-18| 42.103893|-76.799164|
|0fe7a0b7-028e-459...|           gaid|2020-03-18|  43.18203| -77.67202|
+--------------------+---------------+----------+----------+----------+

为什么分钟、小时和秒的信息消失了

如果我尝试TimestampType类型而不是DateType

schema= StructType([
        StructField("identifier", StringType(), True),
        StructField("identifier_type", StringType(), True),
        StructField("timestamp", TimestampType(), True),
        StructField("device_lat", FloatType(), True),
        StructField("device_lon", FloatType(), True)])

myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv') 
myTable = myTable[myTable['device_lat']>0]
sqlContext.registerDataFrameAsTable(myTable, "myTable")

这就是我得到的

myTable.show(3)
+----------+---------------+---------+----------+----------+
|identifier|identifier_type|timestamp|device_lat|device_lon|
+----------+---------------+---------+----------+----------+
+----------+---------------+---------+----------+----------+

变量的类型为

df.dtypes
identifier          object
identifier_type     object
timestamp           object
device_lat         float64
device_lon         float64
dtype: object

Tags: csvtrueobjectschemadevicetypemytabletimestamp
1条回答
网友
1楼 · 发布于 2024-09-27 07:18:26

纯粹是猜测,但我认为您可能需要^{}类型而不是DateType

{}的{a2}只提到月/日/年:

A date type, supporting "0001-01-01" through "9999-12-31". Please use the singleton DataTypes.DateType.

Internally, this is represented as the number of days from epoch (1970-01-01 00:00:00 UTC).

根据Pyspark docs,当使用spark.read()时,可以指定时间戳格式:

timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value, yyyy-MM-dd'T'HH:mm:ss.SSSXXX. The default value looks like it's the ISO standard so if your CSV file has a different timestamp format it won't work without explicitly setting the correct format value.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

因此,如果时间戳CSV值不同于默认的ISO 8601标准格式(例如2020-03-22T21:51:29Z),则需要将CSV日期/时间格式与相应的java.text.SimpleDate格式进行匹配。Java文档中列出了日期/时间格式字符:

对于CSV值,如2020-01-19 19:30:30 UTC,日期格式字符串类似于:yyyy-mm-dd hh:mm:ss z

https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

相关问题 更多 >

    热门问题