我有一个.csv表,如下所示
作为pd进口熊猫 df=pd.read\u csv('myFile.csv') 测向头(3)
identifier identifier_type timestamp device_lat device_lon
0 68d62a1b-b928-4225-b445-9607415905b3 gaid 2020-03-19 03:03:00 UTC 44.808169 -73.522956
1 1675a629-a010-44b6-98a9-72d04793821f gaid 2020-03-18 21:15:42 UTC 42.103894 -76.799164
2 0fe7a0b7-028e-459e-b5d8-b59d31800b8e gaid 2020-03-18 23:39:54 UTC 43.182028 -77.672017
我正在用pyspark
读它
schema= StructType([
StructField("identifier", StringType(), True),
StructField("identifier_type", StringType(), True),
StructField("timestamp", DateType(), True),
StructField("device_lat", FloatType(), True),
StructField("device_lon", FloatType(), True)])
myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv')
myTable = myTable[myTable['device_lat']>0]
myTable.show(3)
+--------------------+---------------+----------+----------+----------+
| identifier|identifier_type| timestamp|device_lat|device_lon|
+--------------------+---------------+----------+----------+----------+
|68d62a1b-b928-422...| gaid|2020-03-19| 44.80817| -73.52296|
|1675a629-a010-44b...| gaid|2020-03-18| 42.103893|-76.799164|
|0fe7a0b7-028e-459...| gaid|2020-03-18| 43.18203| -77.67202|
+--------------------+---------------+----------+----------+----------+
为什么分钟、小时和秒的信息消失了
如果我尝试TimestampType
类型而不是DateType
schema= StructType([
StructField("identifier", StringType(), True),
StructField("identifier_type", StringType(), True),
StructField("timestamp", TimestampType(), True),
StructField("device_lat", FloatType(), True),
StructField("device_lon", FloatType(), True)])
myTable = spark.read.format("csv").schema(schema).load('NY_data/f0.csv')
myTable = myTable[myTable['device_lat']>0]
sqlContext.registerDataFrameAsTable(myTable, "myTable")
这就是我得到的
myTable.show(3)
+----------+---------------+---------+----------+----------+
|identifier|identifier_type|timestamp|device_lat|device_lon|
+----------+---------------+---------+----------+----------+
+----------+---------------+---------+----------+----------+
变量的类型为
df.dtypes
identifier object
identifier_type object
timestamp object
device_lat float64
device_lon float64
dtype: object
纯粹是猜测,但我认为您可能需要^{} 类型而不是
DateType
{}的{a2}只提到月/日/年:
根据Pyspark docs,当使用
spark.read()
时,可以指定时间戳格式:https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
因此,如果时间戳CSV值不同于默认的ISO 8601标准格式(例如
2020-03-22T21:51:29Z
),则需要将CSV日期/时间格式与相应的java.text.SimpleDate
格式进行匹配。Java文档中列出了日期/时间格式字符:对于CSV值,如
2020-01-19 19:30:30 UTC
,日期格式字符串类似于:yyyy-mm-dd hh:mm:ss z
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
相关问题 更多 >
编程相关推荐