文章/答案/技术大牛

发布

社区首页 >问答首页 >编写一个循环，根据不同的条件创建多个新列。

问编写一个循环，根据不同的条件创建多个新列。
EN

Stack Overflow用户

提问于 2022-08-09 06:02:37

回答 1查看 67关注 0票数 0

让我们假设我有一个具有以下列的Pyspark：

用户，得分，国家，风险/安全，payment_id

我列了一个阈值清单: 10，20，30

现在，我想为每个阈值创建一个新列：

%的高风险支付，得分超过阈值的所有支付(风险和安全)
%的高风险不同用户，在所有用户(风险和安全)

中至少有一个得分高于阈值

两者都应按国家分组。

结果应该是这样的：

Country | % payments thresh 10 | % users thresh 10 | % payments thresh 20 ... 
A
B
C

我能够使它与外部for循环一起工作，但我希望它都在一个数据帧中。

thresholds = [10, 20, 30]


for thresh in thresholds:

    
df = (df
     .select('country', 'risk/safe', 'user', 'payment')
     .where(F.col('risk\safe') == 'risk')
     .groupBy('country').agg(F.sum(F.when(
         (F.col('score') >= thresh),1 
           )) / F.count('country').alias('% payments'))

python

dataframe

apache-spark

pyspark

apache-spark-sql

回答 1

Stack Overflow用户

发布于 2022-08-09 06:46:14

在agg()中使用列表理解。

pay_aggs = [(func.sum((func.col('score')>=thresh).cast('int'))/func.count('country')).alias('% pay '+str(thresh)) for thresh in thresholds]
user_aggs = [(func.countDistinct(func.when(func.col('score')>=thresh, func.col('user')))/func.countDistinct('user')).alias('% user '+str(thresh)) for thresh in thresholds]

df. \
    select('country', 'risk/safe', 'user', 'payment'). \
    where(func.col('risk\safe') == 'risk'). \
    groupBy('country'). \
    agg(*pay_aggs, *user_aggs)

pay_aggs列表将生成以下聚合(您可以轻松地打印列表)

# [Column<'(sum(CAST((score >= 10) AS INT)) / count(country)) AS `% pay 10`'>,
#  Column<'(sum(CAST((score >= 20) AS INT)) / count(country)) AS `% pay 20`'>,
#  Column<'(sum(CAST((score >= 30) AS INT)) / count(country)) AS `% pay 30`'>]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73287146

复制

相似问题

问编写一个循环，根据不同的条件创建多个新列。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问编写一个循环，根据不同的条件创建多个新列。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问编写一个循环，根据不同的条件创建多个新列。
EN