文章/答案/技术大牛

发布

社区首页 >问答首页 >红移:在窗口分区上清点不同的客户

问红移:在窗口分区上清点不同的客户
EN

Stack Overflow用户

提问于 2017-12-10 06:48:30

回答 3查看 20.9K关注 0票数 14

Redshift在其窗口函数中不支持DISTINCT聚合。任何窗口函数都不支持和distinct。

我的用例:在不同的时间间隔和流量通道上统计客户。

我希望每月和YTD 唯一的客户数量为本年度，也除以流量渠道以及所有渠道的总数。由于客户可以访问不止一次，所以我只需要计算不同的客户，因此Redshift窗口聚合不会有帮助。

我可以使用count(distinct customer_id)...group by来统计不同的客户，但这只会给我提供四个需要的结果中的一个。
我不想养成这样的习惯，即对堆在一堆union all之间的每一个期望的计数运行一个完整的查询。我希望这不是唯一的解决办法。

这就是我在postgres (或甲骨文)中所写的：

select order_month
       , traffic_channel
       , count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
       , count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
       , count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
       , count(distinct customer_id) over() as ytd_total_customers

from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */

where to_char(order_month, 'YYYY') = '2017'

如何在Redshift中解决这个问题？

结果需要在红移群集上工作，而且这是一个简化的问题，实际需要的结果有产品类别和客户类型，这会将所需的分区数乘以。因此，一堆union all卷不是一个很好的解决方案。

sql

amazon-redshift

回答 3

Stack Overflow用户

回答已采纳

发布于 2017-12-10 06:48:30

2016年博文提出了这个问题，并提供了一个基本的解决办法，所以谢谢MarkD.Adams。奇怪的是，我在所有的网络上都找不到什么，所以我正在分享我的(测试)解决方案。

关键的洞察力是，dense_rank()按所讨论的项目排序，为相同的项目提供相同的等级，因此最高的级别也是唯一项的计数。如果您试图用下面的内容来交换我想要的每个分区，这将是一个可怕的混乱：

dense_rank() over(partition by order_month, traffic_channel order by customer_id)

因为您需要最高的排名，所以您必须对所有内容进行子查询，并从每一次排序中选择最大值。的重要之处在于将外部查询中的分区与子查询中的相应分区相匹配.

/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
       order_month
       , traffic_channel
       , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
       , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
       , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
       , max(cust_rnk) over()  ytd_total_customers

from (
       select order_month
              , traffic_channel
              , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
              , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
              , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
              , dense_rank() over(order by customer_id)  cust_rnk

       from orders_traffic_channels

       where to_char(order_month, 'YYYY') = '2017'
     )

order by order_month, traffic_channel
;

备注

max()和dense_rank()的分区必须匹配
dense_rank()将对空值进行排序(所有排序都是相同的，最大值)。如果您不想计算null值，您需要一个case when customer_id is not null then dense_rank() ...etc...，或者如果您知道存在空值，可以从max()中减去一个值。

更新2022

红移中的分区计数仍未实现。

我得出的结论是，如果您在将其集成到生产管道中时非常小心，那么这个解决方案是合理的，请记住以下几点：

它创建了大量的代码，这会损害可读性和维护性。
将按组计数的过程隔离到一个转换阶段，而不是将其与同一查询中的其他逻辑概念混合。
使用带有count(distinct ..)的子查询和非分区组来获取每个不同的计数甚至更麻烦，也更难读。

然而，更好的方法是使用dataframe语言来支持分组滚动，比如Spark或Pandas。按组划分的星火滚动是紧凑和可读的，权衡会给您的流程带来另一个执行环境和语言。

票数 21

Stack Overflow用户

发布于 2020-05-07 23:35:44

虽然Redshift在其窗口函数中不支持不同的聚合，但它确实有一个listaggdistinct函数。所以你可以这么做：

regexp_count(
   listaggdistinct(customer_id, ',') over (partition by field2), 
   ','
) + 1

当然，如果,自然出现在customer_id字符串中，则必须找到一个安全的分隔符。

票数 5

Stack Overflow用户

发布于 2022-04-07 06:16:48

另一种方法是使用

在第一选择中：

row_number() over (partition by customer_id,order_month,traffic_channel) as row_n_month_channel

在下一个选择中

sum(case when row_n_month_channel=1 then 1 else 0 end)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47736584

复制

相似问题

问红移:在窗口分区上清点不同的客户
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问红移:在窗口分区上清点不同的客户EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问红移:在窗口分区上清点不同的客户
EN