我正在使用SQL和R进行分析,我想连接两个表,如下所示:
表1:
ID date
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310表2
ID date
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310情况是这样的:客户被呼叫(table1),客户回电话索取更多信息(表二)。
所以我在分析中想做的是:
结果:
ID table1 date table2 date
a11 20150302
a11 20150302 20150303
a22 20150303 20150304
a22 20150304 20150305
a44 20150306 20150306
a66 20150308 20150308
a66 20150309
a66 20150310 20150310对于这个多到多(但我不希望结果是n*m,我需要1到1)匹配/连接,有什么解决方案吗?解决方案,无论是在R或SQL将是需要的。
谢谢
发布于 2015-11-06 20:38:18
SELECT ID, Date1, Date2 FROM (
SELECT joined.ID, joined.Date1, joined.Date2, ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC) AS RowNumber
FROM(
SELECT t1.ID, t1.[Date] as Date1, CASE WHEN t2.[Date] >= t1.[Date] THEN t2.[Date] ELSE NULL END as [Date2]
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID) as joined
WHERE joined.Date2 IS NOT NULL
) partitioned
WHERE RowNumber = 1连接ID上的两个表,并删除不在Table 1中的Table 2中的行。然后使用ROW_NUMBER() OVER (PARTITION BY ID, Date1 ORDER BY Date2 ASC)来匹配由WHERE RowNumber = 1子句找到的最近的日期。
生成与所列条件一致的输出:
+-----+----------+----------+
| ID | Date1 | Date2 |
+-----+----------+----------+
| a11 | 20150302 | 20150303 |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150304 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | 20150308 |
| a66 | 20150309 | 20150310 |
| a66 | 20150310 | 20150310 |
+-----+----------+----------+发布于 2015-11-06 22:30:31
我在R和dplyr中得到的结果是一样的。对于a22,20150304初始呼叫的最近回调是20150304,而不是20150305。您需要一个时间组件来区分这一点。
library(dplyr)
inner_join(table1,table2,"ID")%>%
group_by(ID,date1)%>%
filter(date1<=date2)%>%
filter(row_number() == 1)
>
Source: local data frame [7 x 3]
Groups: ID, date1 [7]
ID date1 date2
(chr) (int) (int)
1 a11 20150302 20150303
2 a22 20150303 20150304
3 a22 20150304 20150304
4 a44 20150306 20150306
5 a66 20150308 20150308
6 a66 20150309 20150310
7 a66 20150310 20150310数据
table1 <-read.table(text="ID date1
a11 20150302
a11 20150302
a22 20150303
a22 20150304
a33 20150306
a44 20150306
a55 20150307
a66 20150308
a66 20150309
a66 20150310", header=T,stringsAsFactors =F)
table2 <-read.table(text="ID date2
a11 20150303
a22 20150304
a22 20150305
a44 20150306
a66 20150308
a66 20150310", header=T,stringsAsFactors =F)发布于 2015-11-06 22:45:17
这并不能解决问题,但很接近,也许会给你一个想法
SqlFiddleDemo
With t_left as (
SELECT *, row_number() over (partition by "ID" order by date desc ) as rn
FROM Table1 T
WHERE EXISTS (SELECT 1 FROM Table2 P WHERE T."ID" = P."ID")
),
t_right as (
SELECT *, row_number() over (partition by "ID" order by date desc) as rn
FROM Table2
)
SELECT t_left."ID", t_left."date", t_right."date"
FROM t_left
LEFT JOIN t_right
on t_left.rn = t_right.rn
and t_left."ID" = t_right."ID"
ORDER BY t_left."ID", t_left."date"输出
| ID | date | date |
|-----|----------|----------|
| a11 | 20150302 | 20150303 |
| a11 | 20150302 | (null) |
| a22 | 20150303 | 20150304 |
| a22 | 20150304 | 20150305 |
| a44 | 20150306 | 20150306 |
| a66 | 20150308 | (null) |
| a66 | 20150309 | 20150308 |
| a66 | 20150310 | 20150310 |https://stackoverflow.com/questions/33575076
复制相似问题