首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >PySpark:带条件的布尔先前值

PySpark:带条件的布尔先前值
EN

Stack Overflow用户
提问于 2022-05-11 13:33:22
回答 1查看 32关注 0票数 1

我有这样的数据:

代码语言:javascript
复制
data = [("110125","James","2021-12-05","NY","PA",60000),("110125","James","2021-12-07","NY","PA",3000),("110125","James","2021-12-07","NY","AT",3000),
            ("5225","Michael","2021-12-25","LA","AT",60000),("5225","Michael","2021-12-17","LA","PA",15000),("5225","Michael","2021-12-17","LA","PA",65000)]
columns = ["id","Name","Date","Local","Office","salary"]
df = spark.createDataFrame(data = data, schema = columns)

输入:

代码语言:javascript
复制
   +--------+--------+----------+-----+------+------+
    | id     |Name    |Date      |Local|Office|salary|
    +--------+--------+----------+-----+------+------+
    |  110125| James  |2021-12-05|NY   |PA    | 60000| 
    |  110125| James  |2021-12-07|NY   |PA    | 3000 | 
    |  110125| James  |2021-12-07|NY   |AT    | 3000 | 
    |  5225  | Michael|2021-12-25|LA   |AT    | 60000| 
    |  5225  | Michael|2021-12-17|LA   |PA    | 15000| 
    |  5225  | Michael|2021-12-17|LA   |PA    | 65000| 
    +--------+--------+----------+-----+------+------+

我想要一个新的列'Check',如果4个值中的一个Date, Local; Offfice; Salary与以前的值不同,而同一个id, name是如此真实的话。

输出:

代码语言:javascript
复制
+--------+--------+----------+-----+------+------+-----+
| id     |Name    |Date      |Local|Office|salary|Check|
+--------+--------+----------+-----+------+------+-----+
|  110125| James  |2021-12-05|NY   |PA    | 60000|     |
|  110125| James  |2021-12-07|NY   |PA    | 3000 | True|
|  110125| James  |2021-12-07|NY   |AT    | 3000 | True|
|  5225  | Michael|2021-12-25|LA   |AT    | 60000|     |
|  5225  | Michael|2021-12-17|LA   |PA    | 15000| True|
|  5225  | Michael|2021-12-17|LA   |PA    | 65000| True|
+--------+--------+----------+-----+------+------+-----+

我的代码PySpark:

代码语言:javascript
复制
df.groupby("ID", "Name").withColumn("Check", F.when((F.col('Local') == F.lag('Local')) |(F.col('Office') == F.lag('Office'))|
                                           (F.col('Date') == F.lag('Date'))|(F.col('salary') == F.lag('salary')), False ).otherwise(True))

AttributeError:'GroupedData‘对象没有属性'withColumn'

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-05-11 13:45:55

您想要使用窗口:

代码语言:javascript
复制
from pyspark.sql import Window, functions as F

w = Window.partitionBy("id", "name").orderBy("Date")

df = df.withColumn(
    "Check",
    ~((F.col('Local') == F.lag('Local').over(w))
      & (F.col('Office') == F.lag('Office').over(w))
      & (F.col('Date') == F.lag('Date').over(w))
      & (F.col('salary') == F.lag('salary').over(w))
      )
)

df.show()
#+------+-------+----------+-----+------+------+-----+
#|    id|   Name|      Date|Local|Office|salary|Check|
#+------+-------+----------+-----+------+------+-----+
#|110125|  James|2021-12-05|   NY|    PA| 60000| null|
#|110125|  James|2021-12-07|   NY|    PA|  3000| true|
#|110125|  James|2021-12-07|   NY|    AT|  3000| true|
#|  5225|Michael|2021-12-17|   LA|    PA| 15000| null|
#|  5225|Michael|2021-12-17|   LA|    PA| 65000| true|
#|  5225|Michael|2021-12-25|   LA|    AT| 60000| true|
#+------+-------+----------+-----+------+------+-----+
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72202016

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档