数据集这里的任务是:统计具有多个事务并且至少有一个事务在另一个事务间隔7天内完成的用户。
数据集的结构:行、userId、orderId、日期
日期格式为YYYY MM-DDTHH:MM:SS示例: 2016-09-16T11:32:06
我已经完成了第一部分(使用多个事务计数用户),但我不知道如何在同一个查询中完成第二部分。我会感谢你的帮助。
这是控制台:
query = '''
SELECT COUNT(*)
FROM
(SELECT userId FROM `dataset` GROUP BY userId HAVING COUNT(orderId) > 1)
'''
project_id = 'acdefg'
df = pd.io.gbq.read_gbq(query, project_id=project_id, dialect='standard')
display(df)发布于 2022-02-13 15:02:18
要解决这个问题,您希望能够将每条记录与前一条记录进行比较:何时是来自同一用户的最后一条订单。这暗示了分区和窗口函数的使用,在本例中是LAG。
解决此问题的一种可能方法是组织每个用户的记录,并通过orderDate对它们进行排序,然后对每条记录查看一下上面的记录:
WITH intermediate_table AS (
SELECT
userId,
orderDate,
LAG(orderDate)
OVER (PARTITION BY userId ORDER BY orderDate) -- this is where we pick the orderDate of the record right above, once the orders are organized by userId and ordered by orderDate
FROM `dataset.table`
)
SELECT userId
FROM intermediate_table
WHERE DATE_DIFF(orderDate, previous_order, DAY) <= 7
GROUP BY userId一旦orderDate和previous_order信息被收集到相同的记录中,就很容易对它们进行比较,看看两者之间是否有不到7天的时间。
(GROUP仅用于在结果表中返回一次userIds )
发布于 2022-02-14 02:07:47
这可能是你所需要的:
-- for each order calculate the days since that customer's last order
order_profiler AS (
SELECT
orderId,
orderDate,
custId,
DATE_DIFF(orderDate, LAG(orderDate) OVER (PARTITION BY custId ORDER BY orderDate), day) AS order_latency_days,
FROM
`dataset.table`
)
SELECT
custId,
FROM order_profiler
WHERE order_latency_days <= 7
GROUP BY custIdhttps://stackoverflow.com/questions/71100314
复制相似问题