我有一个训练数据集,它由60000个观察组成,我想要从其中创建9个子集训练集。我想在没有替换的情况下随机取样;我需要3个500个观测数据集,3个1000个观测数据集和3个2000个观测数据集。

我如何使用R中的sample()来完成这个任务?
发布于 2022-11-07 22:38:11
如果您的data.frame被命名为df,那么您可以:
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- sample(60000, sum(sample_sizes))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes)) 这可以在不替换所有数据集的情况下进行采样。如果希望在每个培训集中进行抽样而不进行替换(但不是通过所有培训集):
sample_sizes <- c(rep(500,3), rep(1000,3), rep(2000,3))
sampling <- do.call(c, lapply(sample_sizes, function(i) sample(60000, i)))
training_sets <- split(df[sampling,], rep(1:9, sample_sizes)) 发布于 2022-11-07 22:53:36
如果您希望输出看起来像屏幕快照,我不肯定,但是如果是这样的话,您可以这样做:
library(tidyverse)
df <- tibble(rand = runif(6e4))
tibble(`Sample Size` = rep(c(500,1000,2000), each = 3)) |>
mutate(name = rep(paste(c("First", "Second", "Third"), "Random Sample"), 3),
samp = map2(`Sample Size`, row_number(),
\(x,y) {set.seed(y); df[sample(1:nrow(df), size = x),]})) |>
pivot_wider(names_from = name, values_from = samp)
#> # A tibble: 3 x 4
#> `Sample Size` `First Random Sample` `Second Random Sample` Third Random Samp~1
#> <dbl> <list> <list> <list>
#> 1 500 <tibble [500 x 1]> <tibble [500 x 1]> <tibble [500 x 1]>
#> 2 1000 <tibble [1,000 x 1]> <tibble [1,000 x 1]> <tibble>
#> 3 2000 <tibble [2,000 x 1]> <tibble [2,000 x 1]> <tibble>
#> # ... with abbreviated variable name 1: `Third Random Sample`https://stackoverflow.com/questions/74353595
复制相似问题