当涉及到R编码时,我现在有点墨守成规。我一直在尝试使用mutate、seq和rep函数来生成一个新列,该列迭代多个列值和不同的条件,但结果并不正确。下面是我的一些数据片段:
library(tidyverse)
library(data.table)
library(stringr)
lipidData <- data.frame("Type"=c(rep("LDL",5),rep("HDL",5)),
"featureID"=c(12,12,12,12,13,13,14,15,16,17),
"featureID2"=c(21,22,23,26,31,31,31,31,38,40))
lipidWrong <- lipidData %>%
group_by(Type,featureID) %>%
group_by(Type,featureID2) %>%
mutate(lipidName=paste0(rep("lipid",n()),"_",seq(1,n())))
lipidWrong
Type featureID featureID2 lipidName
<fct> <dbl> <dbl> <chr>
1 LDL 12 21 lipid_1
2 LDL 12 22 lipid_1
3 LDL 12 23 lipid_1
4 LDL 12 26 lipid_1
5 LDL 13 31 lipid_1
6 HDL 13 31 lipid_1
7 HDL 14 31 lipid_2
8 HDL 15 31 lipid_3
9 HDL 16 38 lipid_1
10 HDL 17 40 lipid_1 我希望将lipidName按类型和featureID分组,然后查看类型特性ID2,而不是不正确数据表。如果它们具有相同的类型和featureID,则将它们算作lipidName的相同脂质。如果它们具有相同的类型和featureID2,则将它们算作lipidName的相同脂质。由于我的实际数据集包含超过100,000行,因此如果知道如何对整个数据集的数字进行排序,而不仅仅是group_by中的n()结果,那就太好了。
我希望看到我的结果如下:
lipidCorrect
Type featureID featureID2 lipidName
1 LDL 12 21 lipid_1 # same type and featureID
2 LDL 12 22 lipid_1 # same type and featureID
3 LDL 12 23 lipid_1 # same type and featureID
4 LDL 12 26 lipid_1 # same type and featureID
5 LDL 13 31 lipid_2 # although featureID is the same with row6, it has a different type
6 HDL 13 31 lipid_3 # same type and featureID2
7 HDL 14 31 lipid_3 # same type and featureID2
8 HDL 15 31 lipid_3 # same type and featureID2
9 HDL 16 38 lipid_4
10 HDL 17 40 lipid_5如果我的group_by()和mutate()做错了什么,请让我知道,还有更好的方法来产生想要的结果。
谢谢!
发布于 2020-10-22 05:11:59
如果我正确理解了这个问题(使用@Gregor Thomas的漂亮的澄清问题和评论),那么基于tidyverse的(笨拙的)解决方案可能如下所示。
library(dplyr)
library(stringr)
lipidData %>%
group_by(Type, featureID) %>%
mutate(lipidGroup1 = +(n() > 1)) %>%
group_by(Type, featureID2) %>%
mutate(lipidGroup2 = +(n() > 1)) %>%
ungroup() %>%
mutate(lipidGroup3 = +(lipidGroup1 == 0 & lipidGroup2 == 0)) %>%
group_by(Type, featureID) %>%
mutate(lipidGroup1 = if_else(n() > 1 & row_number() == min(row.names(.)), 1, 0)) %>%
group_by(Type, featureID2) %>%
mutate(lipidGroup2 = if_else(n() > 1 & row_number() == min(row.names(.)), 1, 0)) %>%
ungroup() %>%
mutate(lipidName = str_c('lipid_', cumsum(lipidGroup1 + lipidGroup2 + lipidGroup3))) %>%
select(-starts_with('lipidGroup'))
# Type featureID featureID2 lipidName
# <chr> <dbl> <dbl> <chr>
# 1 LDL 12 21 lipid_1
# 2 LDL 12 22 lipid_1
# 3 LDL 12 23 lipid_1
# 4 LDL 12 26 lipid_1
# 5 LDL 13 31 lipid_2
# 6 HDL 13 31 lipid_3
# 7 HDL 14 31 lipid_3
# 8 HDL 15 31 lipid_3
# 9 HDL 16 38 lipid_4
# 10 HDL 17 40 lipid_5 发布于 2020-10-22 21:44:14
下面是一个使用helper变量来跟踪哪个分组生成唯一ID的版本,然后将其转换为final变量:
lipidData %>%
group_by(Type, featureID) %>%
mutate(
name_id = case_when(n() > 1 ~ paste("fid1", cur_group_id()), TRUE ~ NA_character_)
) %>%
group_by(Type,featureID2) %>%
mutate(
name_id = case_when(is.na(name_id) ~ paste("fid2", cur_group_id()), TRUE ~ name_id)
) %>%
ungroup() %>%
mutate(
lipidName = paste("lipid", as.integer(factor(name_id, levels = unique(name_id))), sep = "_")
) %>%
select(-name_id)
# # A tibble: 10 x 4
# Type featureID featureID2 lipidName
# <chr> <dbl> <dbl> <chr>
# 1 LDL 12 21 lipid_1
# 2 LDL 12 22 lipid_1
# 3 LDL 12 23 lipid_1
# 4 LDL 12 26 lipid_1
# 5 LDL 13 31 lipid_2
# 6 HDL 13 31 lipid_3
# 7 HDL 14 31 lipid_3
# 8 HDL 15 31 lipid_3
# 9 HDL 16 38 lipid_4
# 10 HDL 17 40 lipid_5 https://stackoverflow.com/questions/64468534
复制相似问题