我感兴趣的是对同时具有时间固定值和时变值的敏感数据集进行去识别。我希望(a)按社保号码对所有案例进行分组,(b)为这些案例分配一个唯一的ID,然后(c)删除社保号码。
以下是一个示例数据集:
personal_id gender temperature
111-11-1111 M 99.6
999-999-999 F 98.2
111-11-1111 M 97.8
999-999-999 F 98.3
888-88-8888 F 99.0
111-11-1111 M 98.9任何解决方案都将不胜感激。
发布于 2016-09-29 03:41:16
dplyr具有用于创建唯一组ID的group_indices函数
library(dplyr)
data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),
gender = c("M", "F", "M", "M"),
temperature = c(99.6, 98.2, 97.8, 95.5))
data$group_id <- data %>% group_indices(personal_id)
data <- data %>% select(-personal_id)
data
gender temperature group_id
1 M 99.6 1
2 F 98.2 3
3 M 97.8 2
4 M 95.5 1或在同一管道中(https://github.com/tidyverse/dplyr/issues/2160):
data %>%
mutate(group_id = group_indices(., personal_id))发布于 2020-06-25 21:40:42
从dplyr 1.0.0开始,dplyr::group_indices()已被弃用。应该改用dplyr::cur_group_id():
df %>%
group_by(personal_id) %>%
mutate(group_id = cur_group_id())
personal_id gender temperature group_id
<chr> <chr> <dbl> <int>
1 111-11-1111 M 99.6 1
2 999-999-999 F 98.2 3
3 111-11-1111 M 97.8 1
4 999-999-999 F 98.3 3
5 888-88-8888 F 99 2
6 111-11-1111 M 98.9 1发布于 2016-09-26 17:59:37
使用dplyr包:
library(dplyr)
data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),
gender = c("M", "F", "M", "M"),
temperature = c(99.6, 98.2, 97.8, 95.5))首先,提取personal_id以创建唯一的ID:
cases <- data.frame(levels = levels(data$personal_id))使用行名,您可以获得唯一的标识符:
cases <- cases %>%
mutate(id = rownames(cases))结果:
levels id
1 111-111-111 1
2 222-222-222 2
3 999-999-999 3然后,将案例数据帧与您的数据连接起来:
data <- left_join(data, cases, by = c("personal_id" = "levels"))您可以通过粘贴与性别一起生成的ID来创建更独特的id:
mutate(UID = paste(id, gender, sep=""))最后删除personal_id和简单id:
select(-personal_id, -id)这就对了:):
data <- left_join(data, cases, by = c("personal_id" = "levels")) %>%
mutate(UID = paste(id, gender, sep="")) %>%
select(-personal_id, -id)结果:
gender temperature UID
1 M 99.6 1M
2 F 98.2 3F
3 M 97.8 2M
4 M 95.5 1Mhttps://stackoverflow.com/questions/39650511
复制相似问题