问将本月累计和添加到数据集
EN

Stack Overflow用户

提问于 2021-01-04 01:19:02

回答 1查看 39关注 0票数 0

df.select(
    *df.columns[:2],
    *[F.sum(F.col(i)).over(Window.orderBy('Month')).alias(i) for i in df.columns[2:8]]
)
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|Month  |month_index|QA_count|BS_count|BV_count|QT_count|B  |QB_count|BT_count|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|2020-09|0          |3       |0       |1       |1       |2  |3       |7       |
|2020-10|1          |4       |1       |2       |2       |7  |12      |8       |
|2020-11|2          |5       |2       |3       |3       |12 |21      |9       |
|2020-12|3          |6       |3       |4       |4       |17 |30      |10      |    |
+-------+-----------+--------+--------+--------+--------+---+--------+--------+

我目前有一个数据集，显示列的累积和按月像上面的数据集，但是我希望有当前的月份行自动添加，即使我没有额外的新数据yet.my所需的输出将类似于此

+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|Month  |month_index|QA_count|BS_count|BV_count|QT_count|B  |QB_count|BT_count|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|2020-09|0          |3       |0       |1       |1       |2  |3       |7       |
|2020-10|1          |4       |1       |2       |2       |7  |12      |8       |
|2020-11|2          |5       |2       |3       |3       |12 |21      |9       |
|2020-12|3          |6       |3       |4       |4       |17 |30      |10      |
|2021-01|4          |6       |3       |4       |4       |17 |30      |10      |
+-------+-----------+--------+--------+--------+--------+---+--------+--------+

附注:但是，当在2021-01月份有新的计数时，它应该自动将该新的计数添加到累积和中。

apache-spark

pyspark

apache-spark-sql

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-01-04 01:25:14

import pyspark.sql.functions as F

df2 = df.select(
    *df.columns[:2],
    *[F.sum(F.col(i)).over(Window.orderBy('Month')).alias(i) for i in df.columns[2:8]]
)

# check if there is any new data. if there isn't, add the same row as the last row.
if df2.select('Month').orderBy(F.desc('Month')).head(1)[0] != df2.select(F.date_format(F.current_date(), 'yyyy-MM')).head(1)[0]:
    df3 = df2.union(
        df2.orderBy(F.desc('Month')).limit(1)
           .withColumn('Month', F.date_format(F.current_date(), 'yyyy-MM'))
           .withColumn('month_index', F.col('month_index')+1)
    )
else:
    df3 = df2

df3.show()
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|  Month|month_index|QA_count|BS_count|BV_count|QT_count|  B|QB_count|BT_count|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|2020-09|          0|       3|       0|       1|       1|  2|       3|       7|
|2020-10|          1|       4|       1|       2|       2|  7|      12|       8|
|2020-11|          2|       5|       2|       3|       3| 12|      21|       9|
|2020-12|          3|       6|       3|       4|       4| 17|      30|      10|
|2021-01|          4|       6|       3|       4|       4| 17|      30|      10|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65552733

复制

相似问题

问将本月累计和添加到数据集
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将本月累计和添加到数据集EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将本月累计和添加到数据集
EN