我有一个数据集,其中包含员工id、姓名和他们的银行帐户信息。这些员工中的一些具有相同的员工id或相同员工名称的不同员工id的重复名称。这些雇员中很少有相同姓名的银行账户信息,而有些雇员在同一名称下有不同的银行帐号。目的是找出同名但不同银行帐号的雇员。下面是一个数据示例:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 123 | Joan | 6758 |
| 134 | Karyn | 1244 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
| 235 | Larry | 5201 |
| 433 | Larry | 5201 |
| 231 | Larry | 5201 |
| 120 | Amy | 7890 |
| 135 | Amy | 7890 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |我必须根据他们的名字找到那些重复的雇员,我可以成功地做到这一点。一旦做到这一点,我必须确认雇员的同名,但不同的银行帐户编号。目前的问题是,它没有根据姓名和搜索不同的银行账户对员工进行分组。相反,它正在寻找不同个人的帐号,如果它发现它是相同的,它会删除一个重复的值。例如,克里斯和卡西有相同的银行帐号' 1280 ',因此它将其标识为相同,并自动删除Chris的一个记录(输出中的银行帐户编号为1280)。我得到的输出如下所示:
| Emp_id | Name | Bank Account |
|--------|:-----:|-------------:|
| 120 | Amy | 7890 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |这是我遵循的准则:
sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]但是,实际输出应该考虑到每个名称(同名)内的银行帐户编号。输出应该如下所示:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 900 | Cassy |1280 |
| 900 | Cassy |9873 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |有人能告诉我正确的密码吗?
发布于 2019-07-19 14:55:24
我们可以使用n_distinct来实现filter
library(dplyr)
sample %>%
group_by(Name) %>%
filter(n() > 1) %>%
group_by(Id, add = TRUE) %>%
filter(n_distinct(Bank_Account) > 1) %>%
arrange(desc(Id))
# A tibble: 6 x 3
# Groups: Name, Id [3]
# Id Name Bank_Account
# <fct> <fct> <fct>
#1 900 Cassy 1280
#2 900 Cassy 9873
#3 150 Chris 1280
#4 150 Chris 6565
#5 143 Larry 4900
#6 143 Larry 5201 发布于 2019-07-19 13:19:35
步骤1-识别重复名称:
step_1 <- sample %>%
arrange(Name) %>%
mutate(dup = duplicated(Name)) %>%
filter(Name %in% unique(as.character(Name[dup == T])))步骤2-识别这些名称的重复帐户:
step_2 <- step_1 %>%
group_by(Name, Bank_Account) %>%
mutate(dup = duplicated(Bank_Account)) %>%
filter(dup == F)https://stackoverflow.com/questions/57113195
复制相似问题