我有这样的数据:
data = [("110125","James","2021-12-05","NY","PA",60000),("110125","James","2021-12-07","NY","PA",3000),("110125","James","2021-12-07","NY","AT",3000),
("5225","Michael","2021-12-25","LA","AT",60000),("5225","Michael","2021-12-17","LA","PA",15000),("5225","Michael","2021-12-17","LA","PA",65000)]
columns = ["id","Name","Date","Local","Office","salary"]
df = spark.createDataFrame(data = data, schema = columns)输入:
+--------+--------+----------+-----+------+------+
| id |Name |Date |Local|Office|salary|
+--------+--------+----------+-----+------+------+
| 110125| James |2021-12-05|NY |PA | 60000|
| 110125| James |2021-12-07|NY |PA | 3000 |
| 110125| James |2021-12-07|NY |AT | 3000 |
| 5225 | Michael|2021-12-25|LA |AT | 60000|
| 5225 | Michael|2021-12-17|LA |PA | 15000|
| 5225 | Michael|2021-12-17|LA |PA | 65000|
+--------+--------+----------+-----+------+------+我想要一个新的列'Check',如果4个值中的一个Date, Local; Offfice; Salary与以前的值不同,而同一个id, name是如此真实的话。
输出:
+--------+--------+----------+-----+------+------+-----+
| id |Name |Date |Local|Office|salary|Check|
+--------+--------+----------+-----+------+------+-----+
| 110125| James |2021-12-05|NY |PA | 60000| |
| 110125| James |2021-12-07|NY |PA | 3000 | True|
| 110125| James |2021-12-07|NY |AT | 3000 | True|
| 5225 | Michael|2021-12-25|LA |AT | 60000| |
| 5225 | Michael|2021-12-17|LA |PA | 15000| True|
| 5225 | Michael|2021-12-17|LA |PA | 65000| True|
+--------+--------+----------+-----+------+------+-----+我的代码PySpark:
df.groupby("ID", "Name").withColumn("Check", F.when((F.col('Local') == F.lag('Local')) |(F.col('Office') == F.lag('Office'))|
(F.col('Date') == F.lag('Date'))|(F.col('salary') == F.lag('salary')), False ).otherwise(True))AttributeError:'GroupedData‘对象没有属性'withColumn'
发布于 2022-05-11 13:45:55
您想要使用窗口:
from pyspark.sql import Window, functions as F
w = Window.partitionBy("id", "name").orderBy("Date")
df = df.withColumn(
"Check",
~((F.col('Local') == F.lag('Local').over(w))
& (F.col('Office') == F.lag('Office').over(w))
& (F.col('Date') == F.lag('Date').over(w))
& (F.col('salary') == F.lag('salary').over(w))
)
)
df.show()
#+------+-------+----------+-----+------+------+-----+
#| id| Name| Date|Local|Office|salary|Check|
#+------+-------+----------+-----+------+------+-----+
#|110125| James|2021-12-05| NY| PA| 60000| null|
#|110125| James|2021-12-07| NY| PA| 3000| true|
#|110125| James|2021-12-07| NY| AT| 3000| true|
#| 5225|Michael|2021-12-17| LA| PA| 15000| null|
#| 5225|Michael|2021-12-17| LA| PA| 65000| true|
#| 5225|Michael|2021-12-25| LA| AT| 60000| true|
#+------+-------+----------+-----+------+------+-----+https://stackoverflow.com/questions/72202016
复制相似问题