我有以下数据集,其中每个观察属于一个集群和一个组
df <- data.frame(
cluster = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5),
group = c("g1", "g2", "g1", "g2", "g1", "both", "g1", "g1", "both", "both", "g1", "g2", "g1",
"g1", "g1", "both", "g1", "g2", "both"),
stringsAsFactors = FALSE
)cluster group
1 "g1"
1 "g2"
1 "g1"
1 "g2"
2 "g1"
2 "both"
2 "g1"
2 "g1"
3 "both"
3 "both"
3 "g1"
3 "g2"
4 "g1"
4 "g1"
4 "g1"
4 "both"
5 "g1"
5 "g2"
5 "both"我想要获得的是根据这条规则,将"group“中等于”双“的任何元素转换为"g1”或"g2“:
对于任何集群,等于“两者”的元素应该等于最不频繁的元素。(因此,如果在一个集群中,我有4个标记为"g1“的观察,2个标记为"g2‘,以及2个标记为"g2”,都标记为"g2")。在这种情况下,我有一个集群,其中一个元素等于"g1“,另一个元素等于"g2”,两个元素等于"g2“,我希望将其中一个元素转换为"g1”,另一个元素转换为“g2”。基本上,对于每个集群,我希望转换与类“两类”相同的元素,以最大化两个类"g1“和”g2“的最小频率。
max(min(freq(g1),freq(g2))
(如果在集群中,g1频率=2,而G2频率=3,并且我有一个元素=“两者”,我想将其转换为g1,使G1频率=3)
因此,预期结果是:
cluster group
1 "g1"
1 "g2"
1 "g1"
1 "g2"
2 "g1"
2 "both"
2 "g1"
2 "g1"
3 "g1" (or "g2" )
3 "g2" (or "g3")
3 "g1"
3 "g2"
4 "g1"
4 "g1"
4 "g1"
4 "g2"
5 "g1"
5 "g2"
5 "g2" (or "g1")我希望我的目标是清楚的。
发布于 2020-06-05 15:21:31
这似乎是一种有点冗长的做法,但希望这是可以理解的,而且效果很好:
f <- function(x)
{
n_replace <- length(which(x == "both"))
n_g1 <- length(which(x == "g1"))
n_g2 <- length(which(x == "g2"))
n_diff <- n_g1 - n_g2
result <- character()
if(n_diff != 0)
{
result <- c(result, rep(ifelse(n_diff > 0, "g2", "g1"), abs(n_diff)))
n_replace <- n_replace - n_diff
}
if(n_replace > 0) return(c(result, rep(c("g1", "g2"), length = n_replace)))
result
}
df %>%
group_by(cluster) %>%
mutate(group = ifelse(group == "both", f(group), group))其结果如下:
# A tibble: 19 x 2
# Groups: cluster [5]
cluster group
<int> <chr>
1 1 g1
2 1 g2
3 1 g1
4 1 g2
5 2 g1
6 2 g2
7 2 g1
8 2 g1
9 3 g1
10 3 g2
11 3 g1
12 3 g2
13 4 g1
14 4 g1
15 4 g1
16 4 g2
17 5 g1
18 5 g2
19 5 g1 https://stackoverflow.com/questions/62217866
复制相似问题