我有这个数据集
+---------+------+------------------+--------------------+-------------+
| LCLid|season| sum(KWH/hh)| avg(KWH/hh)|Acorn_grouped|
+---------+------+------------------+--------------------+-------------+
|MAC000023|autumn|4067.4269999000007| 0.31550007755972703| 4|
|MAC000128|spring| 961.2639999999982| 0.10876487893188484| 2|
|MAC000012|summer| 121.7360000000022|0.027548314098212765| 0|
|MAC000053|autumn| 2289.498000000006| 0.17883908764255632| 2|
|MAC000121|spring| 1893.635999900008| 0.21543071671217384| 1|对于每个consumerID,我们有每个月的总和和平均消耗量,acron分组是为每个消费者固定的
我想根据id进行聚合,同时提取那些新的特征,并有四舍五入的数字来最终获得这些数据。
+---------+-------------+-------------------+------------------+------------------+------------------
| LCLid|Acorn_grouped|autumn_avg(KWH/hh) |autumn_sum(KWH/hh)|autumn_max(KWH/hh)|spring_avg(KWH/hh)
+---------+-------------+-------------------+------------------+------------------+-----------------
|MAC000023| 4| | | |
|MAC000128| 2| | | |
|MAC000012| 0| | | |
|MAC000053| 2| | | |
|MAC000121| 1| | | |发布于 2021-01-02 03:55:33
你可以做一个轴心:
import pyspark.sql.functions as F
result = df.groupBy('LCLid', 'Acorn_grouped') \
.pivot('season') \
.agg(
F.round(F.first('sum(KWH/hh)')).alias('sum(KWH/hh)'),
F.round(F.first('avg(KWH/hh)')).alias('avg(KWH/hh)')
).fillna(0) # replace nulls with zero -
# you can skip this if you want to keep nullshttps://stackoverflow.com/questions/65532980
复制相似问题