我有数据:
from datetime import datetime
data = [
(1, datetime(2018, 7, 25, 17, 15, 6, 390000)),
(2, datetime(2018, 7, 25, 11, 12, 49, 317000))
]
df = spark.createDataFrame(data, ['ID', 'max_ts'])
# +---+-----------------------+
# |ID |max_ts |
# +---+-----------------------+
# |1 |2018-07-25 17:15:06.39 |
# |2 |2018-07-25 11:12:49.317|
# +---+-----------------------+我想创建一个列milliseconds
+---+-----------------------+------+
|ID |max_ts |ms |
+---+-----------------------+------+
|1 |2018-07-25 17:15:06.39 |390000|
|2 |2018-07-25 11:12:49.317|317000|
+---+-----------------------+------+我可以用熊猫来做这件事
df_interfax['ms_created_at'] = df_interfax['max_ts'].dt.microsecond但是如何在PySpark中做到这一点呢?
发布于 2022-10-03 18:03:48
一种选择:
from pyspark.sql import functions as F
df = df.withColumn('ms', F.expr("date_part('s', max_ts) % 1 * pow(10, 6)"))
df.show(truncate=0)
# +---+-----------------------+--------+
# |ID |max_ts |ms |
# +---+-----------------------+--------+
# |1 |2018-07-25 17:15:06.39 |390000.0|
# |2 |2018-07-25 11:12:49.317|317000.0|
# +---+-----------------------+--------+另一种选择:
df = df.withColumn('ms', F.expr("unix_micros(max_ts) - unix_micros(date_trunc('second', max_ts))"))
df.show(truncate=0)
# +---+-----------------------+------+
# |ID |max_ts |ms |
# +---+-----------------------+------+
# |1 |2018-07-25 17:15:06.39 |390000|
# |2 |2018-07-25 11:12:49.317|317000|
# +---+-----------------------+------+https://stackoverflow.com/questions/73938937
复制相似问题