我需要清理一个包含200多万个人及其职业条目的数据库。现在,我研究了一些独特的职业,我遇到了一些我需要清理的东西。例如,在职业列表中,对人力资源员工的描述有以下变化:
HR-employee
HR: employee
HR:Employee
HR - employee
HR : employee
HR employee现在,我需要找到一种方法来获取上面描述的所有条目,以获得“人力资源员工”的描述。对其他2000份工作说明也是如此。有没有一种简单的方法可以将所有这些“重复的职务说明”合并成一个职务说明?
发布于 2022-10-02 12:37:54
假设"employee“的变体都在"HR”之前,如您的示例所示,您可以使用str_detect和ifelse
library(dplyr)
library(stringr)
df %>%
mutate(job_clean = ifelse(str_detect(job, "(?i)\\bemployee\\b"),
"HR Employee",
job))
job job_clean
1 HR-employee HR Employee
2 HR: employee HR Employee
3 ZZ - Unemployed ZZ - Unemployed
4 HR:Employee HR Employee
5 XY : smth else XY : smth else
6 HR - employee HR Employee
7 HR : employee HR Employee
8 HR employee HR Employee请注意,(?i)用于使该匹配不区分大小写,而\\b是一个单词边界标记,用于断言只匹配确切的单词"employee“,而不是例如"employees”。
数据:
df <- data.frame(
job = c("HR-employee", "HR: employee", "ZZ - Unemployed", "HR:Employee", "XY : smth else", "HR - employee", "HR : employee", "HR employee"))编辑
如果不仅有"HR“类别ut也有”楼层“类别,那么您可以执行嵌套的操作:
df %>%
mutate(job_clean = ifelse(str_detect(job, "(?i)hr.*\\bemployee\\b"),"HR Employee",
ifelse(str_detect(job, "(?i)floor.*\\bemployee\\b"),"Floor Employee", job)))
job job_clean
1 HR-employee HR Employee
2 HR: employee HR Employee
3 ZZ - Unemployed ZZ - Unemployed
4 HR:Employee HR Employee
5 XY : smth else XY : smth else
6 HR - employee HR Employee
7 HR : employee HR Employee
8 HR employee HR Employee
9 Floor employee Floor Employee
10 Floor-employee Floor Employee数据:
df <- data.frame(
job = c("HR-employee", "HR: employee", "ZZ - Unemployed", "HR:Employee", "XY : smth else", "HR - employee", "HR : employee", "HR employee",
"Floor employee", "Floor-employee"))https://stackoverflow.com/questions/73925681
复制相似问题