我有一个BigQuery表,如下所示:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time hits_eventInfo_Label
20151021 Air Search 1445001 A232 1952 City1
20151021 Air Select 1445001 A232 2300 Vendor1
20151021 Air Search 1445001 A111 1000 City2
20151021 Air Search 1445001 A111 1900 City3
20151021 Air Select 1445001 A111 7380 Vendor2
20151021 Air Search 1445001 A580 1000 City4
20151021 Air Search 1445001 A580 1900 City5
20151021 Air Search 1445001 A580 1900 City6
20151021 Air Select 1445001 A580 7380 Vendor3该表显示了3个用户- A232、A111和A580 -的用户活动,满足以下条件:
i) A232 - Made 1 Search at 'City1' and chose 'Vendor1' from 'City1'
ii) A111 - Made the 1st search at 'City2' and did not choose any vendor from there. Made a 2nd search at 'City3' and then ultimately chose a 'Vendor2' from here.
iii) A580 - 1st search at 'City4', no vendor chosen. 2nd search at 'City5', no vendor chosen. 3rd search at 'City6', 'Vendor3' chosen from City6.我只对检索用户实际选择供应商的城市感兴趣,也就是说,对用户之前进行的没有选择供应商的搜索不感兴趣。
所需输出表:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time city vendor
20151021 Air Search 1445001 A232 1952 City1 Vendor1
20151021 Air Search 1445001 A111 1900 City3 Vendor2
20151021 Air Search 1445001 A580 1900 City6 Vendor3在对hits_eventInfo_eventLabel进行分区并按hits_time (即LAG(hits_eventInfo_eventLabel,1) OVER( PARTITION BY user_id ORDER BY hits_time) )排序之后,我一直在尝试使用user_id字段上的滞后函数来实现这一点
但是,由于我使用的滞后偏移量为1,所以上面的表达式帮助我仅为用户A232获得所需的输出(因为他只进行了1次搜索,这意味着在选择供应商之前的前一条记录肯定是搜索记录)。
有没有一种方法可以使这个滞后表达式更加动态,以便它在进行选择之前只检索搜索到的最接近的位置-而不管在进行选择之前进行了多少次搜索?
或
有没有其他的函数/方法可以实现这一点?
发布于 2015-10-28 05:58:10
select
date,
hits_eventInfo_Category,
hits_eventInfo_Action,
session_id,
user_id,
hits_time,
prev as city,
hits_eventInfo_Label as vendor
from (
select *,
lag(hits_eventInfo_Label, 1) over(partition by user_id order by hits_time) as prev
from dataset.table
)
where hits_eventInfo_Action = 'Select'https://stackoverflow.com/questions/33378812
复制相似问题