我有这个数据框
+---------+------+-----+-------------+-----+
| LCLid|KWH/hh|Acorn|Acorn_grouped|Month|
+---------+------+-----+-------------+-----+
|MAC000002| 0.0| 0| 0| 10|
|MAC000002| 0.0| 0| 0| 10|
|MAC000002| 0.0| 0| 0| 10|我想要按LCid和月平均消费进行分组,只以某种方式进行分组
+---------+-----+------------------+----------+------------------+
| LCLid|Month| sum(KWH/hh)|Acorn |Acorn_grouped |
+---------+-----+------------------+----------+------------------+
|MAC000003| 10| 904.9270009999999| 0 | 0 |
|MAC000022| 2|1672.5559999999978| 1 | 0 |
|MAC000019| 4| 368.4720001000007| 1 | 1 |
|MAC000022| 9|449.07699989999975| 0 | 1 |
|MAC000024| 8| 481.7160003000004| 1 | 0 |但我能做的就是使用下面的代码
dataset=dataset.groupBy("LCLid","Month").sum()这给了我这个结果
+---------+-----+------------------+----------+------------------+----------+
| LCLid|Month| sum(KWH/hh)|sum(Acorn)|sum(Acorn_grouped)|sum(Month)|
+---------+-----+------------------+----------+------------------+----------+
|MAC000003| 10| 904.9270009999999| 2978| 2978| 29780|
|MAC000022| 2|1672.5559999999978| 12090| 4030| 8060|
|MAC000019| 4| 368.4720001000007| 20174| 2882| 11528|
|MAC000022| 9|449.07699989999975| 8646| 2882| 25938|问题是sum函数也是在acron和acron_grouped上计算的,您知道如何仅在KWH/hh上进行分组吗
发布于 2020-12-22 22:13:31
取决于您希望如何处理其他两列。如果您不想对它们求和,而只是想要该列中的任何值,您可以这样做
import pyspark.sql.functions as F
dataset = dataset.groupBy("LCLid","Month").agg(
F.sum("KWH/hh"),
F.first("Acorn").alias("Acorn"),
F.first("Acorn_grouped").alias("Acorn_grouped")
)https://stackoverflow.com/questions/65410157
复制相似问题