Pyspark：如何从时间戳中提取小时

df +------------------------------------+-----------------------+ |identifier |timestamp | +------------------------------------+-----------------------+ |86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC| |38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC| |1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC| |c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC| |0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC| +------------------------------------+-----------------------+ tmp.printSchema() root |-- identifier: string (nullable = true) |-- timestamp: string (nullable = true)

+--------------------+--------------------+----+ | identifier| timestamp|hour| +--------------------+--------------------+----+ |321869c3-71e5-41d...|2020-03-19 03:34:...|null| |226b8d50-2c6a-471...|2020-03-19 02:59:...|null| |47818b7c-34b5-43c...|2020-03-19 01:41:...|null| |f5ca5599-7252-49d...|2020-03-19 04:25:...|null| |add2ae24-aa7b-4d3...|2020-03-19 01:50:...|null| +--------------------+--------------------+----+

+--------------------+--------------------+-------------------+ | identifier| timestamp|hour | +--------------------+--------------------+-------------------+ |321869c3-71e5-41d...|2020-03-19 03:00:...|2020-03-19 03:00:00| |226b8d50-2c6a-471...|2020-03-19 02:59:...|2020-03-19 02:00:00| |47818b7c-34b5-43c...|2020-03-19 01:41:...|2020-03-19 01:00:00| |f5ca5599-7252-49d...|2020-03-19 04:25:...|2020-03-19 04:00:00| |add2ae24-aa7b-4d3...|2020-03-19 01:50:...|2020-03-19 01:00:00| +--------------------+--------------------+-------------------+

3条回答

网友

1楼 · 编辑于 2024-10-03 00:20:52

您应该使用pyspark内置函数date_trunc来截断为hour。您还可以截断为天/月/年等

from pyspark.sql import functions as F
df.withColumn("hour", F.date_trunc('hour',F.to_timestamp("timestamp","yyyy-MM-dd HH:mm:ss 'UTC'")))\
  .show(truncate=False)


+                  +           -+         -+
|identifier                          |timestamp              |hour               |
+                  +           -+         -+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|2020-03-18 14:00:00|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|2020-03-13 15:00:00|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|2020-03-16 12:00:00|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|2020-03-18 22:00:00|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|2020-03-17 18:00:00|
+                  +           -+         -+

网友

2楼 · 编辑于 2024-10-03 00:20:52

使用from_unixtime and unix_timestamp函数，因为hour用于从timestamp（或）string(yyyy-MM-dd HH:mm:ss)类型提取小时值

from pyspark.sql.functions import *
#sample data
df.show(truncate=False)
#+     +           -+
#|identifier|timestamp              |
#+     +           -+
#|1         |2020-03-18 14:41:55 UTC|
#+     +           -+
#DataFrame[identifier: string, timestamp: string]

df.withColumn("hour", from_unixtime(unix_timestamp(col("timestamp"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00")).show()
#+     +          +         -+
#|identifier|           timestamp|               hour|
#+     +          +         -+
#|         1|2020-03-18 14:41:...|2020-03-18 14:00:00|
#+     +          +         -+

Usage of hour function:

#on string type 
spark.sql("select hour('2020-03-04 12:34:34')").show()
#on timestamp type
spark.sql("select hour(timestamp('2020-03-04 12:34:34'))").show()
#+ -+
#|_c0|
#+ -+
#| 12|
#+ -+

网友

3楼 · 编辑于 2024-10-03 00:20:52

为什么不仅仅是自定义自定义自定义项

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

hour = F.udf(lambda x: x.hour, IntegerType())
hours = df.withColumn("hour", hour("datetime"))

hours.limit(5).toPandas()

我会给你这个：

相关问题更多 >

编程相关推荐

热门问题

热门文章