我在R中有以下数据:
id = 1:5
col1 = c("john", "henry", "adam", "jenna", "Phone: 222 2222")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("Phone: 111 1111 111", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("matt", "peter", "Phone: 333 333 1113", "Phone: 444 111 1153", "kevin")
my_data = data.frame(id, col1, col2, col3, col4)
id col1 col2 col3 col4
1 1 john river B8C 9L4 Phone: 111 1111 111 matt
2 2 henry Field U9H 5E2 PP steve peter
3 3 adam NA forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd NA Phone: 444 111 1153
5 5 Phone: 222 2222 dave NA kevin我正试图完成以下任务-我想要创建一个新的数据集与下列列。每一行:
。
下面是所需输出的示例:
id name address phone
1 1 john matt river B8C 9L4 Phone: 111 1111 111
2 2 henry steve peter Field U9H 5E2 PP NA
3 3 adam forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd Phone: 444 111 1153
5 5 dave kevin NA Phone: 222 2222下面是我编写的代码:
my_data$col1[grep("Phone", my_data$col1)]
my_data$col2[grep("Phone", my_data$col2)]
my_data$col3[grep("Phone", my_data$col3)]
my_data$col4[grep("Phone", my_data$col4)]
my_data$col1[grep( '(([A-Z] ?[0-9]){3})|.', '\\1' , my_data$col1)]
my_data$col2[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col2)]
my_data$col3[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col3)]
my_data$col4[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col4)]基于上述代码,我正在考虑在每个步骤中确定哪些列符合条件,然后使用dplyr中的COLASCE语句来创建最终的数据集。但我认为这可能是解决这一问题的一条很长的路。
有人能建议一个更快的方法来解决这个问题吗?
谢谢!
发布于 2022-11-18 17:46:34
试试这个:
tmp <- apply(my_data[,-1], 1, function(z) { z <- z[!is.na(z) & z != "NA"]; ind <- dplyr::case_when(grepl("^Phone:", z) ~ "phone", grepl("(([A-Z] ?[0-9]){3})", z) ~ "address", TRUE ~ "name"); data.frame(lapply(split(z, ind), paste, collapse = " ")); })
tmp
# [[1]]
# address name phone
# 1 river B8C 9L4 john matt Phone: 111 1111 111
# [[2]]
# address name
# 1 Field U9H 5E2 PP henry steve peter
# [[3]]
# address name phone
# 1 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# [[4]]
# address name phone
# 1 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# [[5]]
# name phone
# 1 dave kevin Phone: 222 2222有了这个,我们可以组合它们(我更喜欢dplyr::bind_rows或data.table::rbindlist)。
dplyr::bind_rows(tmp)
# address name phone
# 1 river B8C 9L4 john matt Phone: 111 1111 111
# 2 Field U9H 5E2 PP henry steve peter <NA>
# 3 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# 4 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# 5 <NA> dave kevin Phone: 222 2222
cbind(my_data[,1,drop=FALSE], dplyr::bind_rows(tmp))
# id address name phone
# 1 1 river B8C 9L4 john matt Phone: 111 1111 111
# 2 2 Field U9H 5E2 PP henry steve peter <NA>
# 3 3 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# 4 4 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# 5 5 <NA> dave kevin Phone: 222 2222发布于 2022-11-18 18:13:22
一种基于grep的解决方案
setNames(data.frame(my_data$id,
sapply(c("^[a-z]+[a-z]+$", "^[^P]+.*[[:digit:]]+", "Phone"), function(srch)
apply(my_data[,-1], 1, function(x)
paste(grep(srch, x, value=T), collapse=" ")))),
c("Id", "Name", "Address", "Phone"))
Id Name Address Phone
1 1 john matt river B8C 9L4 Phone: 111 1111 111
2 2 henry steve peter Field U9H 5E2 PP
3 3 adam forest K0Y 1U9 hu2 Phone: 333 333 1113
4 4 jenna ocean A1B 5H1 dd Phone: 444 111 1153
5 5 dave kevin Phone: 222 2222https://stackoverflow.com/questions/74493189
复制相似问题