我在postgresql中有一个包含两列的表:原始id和复制id。
示例数据:
original_id duplicate_id
1 1
2 2
3 3
4 4
5 5
6 6我想以50/50的比例随机拆分这张表,这样我就可以在每个表中放置一个特定的标签
示例数据:
original_id duplicate_id tag
1 1 control
2 2 treatment
3 3 treatment
4 4 control
5 5 treatment
6 6 control重要的是: 1.选择必须是随机的2.分割必须是50/50 (如果行数是奇数,则最接近于50)
发布于 2018-09-05 08:17:10
您可以使用以下查询以随机顺序选择一半的行:
select *
from my_table
order by random()
limit (select count(*)/ 2 from my_table)使用它来标记行:
with control as (
select *
from my_table
order by random()
limit (select count(*)/ 2 from my_table)
)
select
*,
case when t in (select t from control t) then 'control' else 'treatment' end
from my_table t;Working example in rextester.
发布于 2018-09-05 08:19:07
您可以使用rownumber() OVER (ORDER BY random())为每条记录分配一个随机数。然后在CASE中使用它来分配标记'control'或'treatment',这取决于数字是否小于(或等于)表中行计数的一半。
对于如下所示的SELECT:
SELECT original_id,
duplicate_id,
CASE
WHEN rn <= (SELECT count(*) / 2
FROM elbat) THEN
'control'
ELSE
'treatment'
END tag
FROM (SELECT original_id,
duplicate_id,
row_number() OVER (ORDER BY random()) rn
FROM elbat) x;如果你想要一个UPDATE (我不确定这一点),假设original_id和duplicate_id是唯一的,这可能看起来像这样:
UPDATE elbat t
SET tag = CASE
WHEN rn <= (SELECT count(*) / 2
FROM elbat) THEN
'control'
ELSE
'treatment'
END
FROM (SELECT original_id,
duplicate_id,
row_number() OVER (ORDER BY random()) rn
FROM elbat) x
WHERE x.original_id = t.original_id
AND x.duplicate_id = t.duplicate_id;db<>fiddle (顺便说一句,SELECT结果给出了一个很好的例子,返回的行的顺序可以与物理顺序完全不同,如果优化器更喜欢这样的话。)
发布于 2018-09-05 11:56:17
我会使用窗口函数:
select t.*,
(case when seqnum <= cnt / 2
then 'treatment' else 'control
end) as tag
from (select t.*,
count(*) over () as cnt,
row_number() over (order by random() as seqnum
from t
) t;实际上,随机就是随机的。所以,你不需要计数。您可以改用模算法:
select t.*,
(case when row_number() over (order by random()) % 2 = 1
then 'treatment' else 'control'
end) as tag
from t;https://stackoverflow.com/questions/52175254
复制相似问题