Pyspark：如何读取带有时间戳的csv文件？

identifier identifier_type timestamp device_lat device_lon 0 68d62a1b-b928-4225-b445-9607415905b3 gaid 2020-03-19 03:03:00 UTC 44.808169 -73.522956 1 1675a629-a010-44b6-98a9-72d04793821f gaid 2020-03-18 21:15:42 UTC 42.103894 -76.799164 2 0fe7a0b7-028e-459e-b5d8-b59d31800b8e gaid 2020-03-18 23:39:54 UTC 43.182028 -77.672017

schema= StructType([ StructField("identifier", StringType(), True), StructField("identifier_type", StringType(), True), StructField("timestamp", DateType(), True), StructField("device_lat", FloatType(), True), StructField("device_lon", FloatType(), True)]) myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv') myTable = myTable[myTable['device_lat']>0] myTable.show(3) +--------------------+---------------+----------+----------+----------+ | identifier|identifier_type| timestamp|device_lat|device_lon| +--------------------+---------------+----------+----------+----------+ |68d62a1b-b928-422...| gaid|2020-03-19| 44.80817| -73.52296| |1675a629-a010-44b...| gaid|2020-03-18| 42.103893|-76.799164| |0fe7a0b7-028e-459...| gaid|2020-03-18| 43.18203| -77.67202| +--------------------+---------------+----------+----------+----------+

schema= StructType([ StructField("identifier", StringType(), True), StructField("identifier_type", StringType(), True), StructField("timestamp", TimestampType(), True), StructField("device_lat", FloatType(), True), StructField("device_lon", FloatType(), True)]) myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv') myTable = myTable[myTable['device_lat']>0] sqlContext.registerDataFrameAsTable(myTable, "myTable")

1条回答

网友

1楼 · 发布于 2024-09-27 07:18:26

纯粹是猜测，但我认为您可能需要^{}类型而不是DateType

{}的{a2}只提到月/日/年：

A date type, supporting "0001-01-01" through "9999-12-31". Please use the singleton DataTypes.DateType.
Internally, this is represented as the number of days from epoch (1970-01-01 00:00:00 UTC).

根据Pyspark docs，当使用spark.read()时，可以指定时间戳格式：

timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value, yyyy-MM-dd'T'HH:mm:ss.SSSXXX. The default value looks like it's the ISO standard so if your CSV file has a different timestamp format it won't work without explicitly setting the correct format value.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader

因此，如果时间戳CSV值不同于默认的ISO 8601标准格式（例如2020-03-22T21:51:29Z），则需要将CSV日期/时间格式与相应的java.text.SimpleDate格式进行匹配。Java文档中列出了日期/时间格式字符：

对于CSV值，如2020-01-19 19:30:30 UTC，日期格式字符串类似于：yyyy-mm-dd hh:mm:ss z

https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

相关问题更多 >

编程相关推荐

热门问题

热门文章