House_Number<-c("11", "14 Jas", "24 Baker Street", "38 Home Close", "Flat 6, 85", "Flat 9", "38 Hightower Close BG6 7HU")
Street<-c("Pascale Street", "Jasmine Court", "24 Baker Street", "Home Close", "85 The Strand", "28 Lake Close", "38 Hightower Close BG6 7HU")
Postcode<-c("AB1 2BY", "AC2 3DF", "DF4 5TH", "FG4 8TG", "CF5 6YH", "DH7 8UJ", "38 Hightower Close BG6 7HU")
(df<-as.data.frame(cbind(House_Number,Street,Postcode)))我在多个字段中有地址数据;House_Number、街道、邮政编码,其中一些字段中有完全或部分重复的信息。
我计划将这些字段连接在一起,形成一行地址。然而,重复的信息将意味着我最终得到了错误的地址。
因此,我需要从其中一个字段中删除重复的信息。我认为在街道字段中保留完整的信息将是最好的(所以在街道字段中保持街道名称完整,而不是删除它并将其放在House_Number字段中),但这纯粹是首选项,不应该有太大的不同。我如何才能做到这一点呢?
理想情况下,数据帧之后应该是这样的:
House_Number<-c("11", "14 ", "", "38", "Flat 6, ", "Flat 9", "")
Street<-c("Pascale Street", "Jasmine Court", "24 Baker Street", "Home Close", "85 The Strand", "28 Lake Close", "38 Hightower Close BG6 7HU")
Postcode<-c("AB1 2BY", "AC2 3DF", "DF4 5TH", "FG4 8TG", "CF5 6YH", "DH7 8UJ", "")
(df_correct<-as.data.frame(cbind(House_Number,Street,Postcode)))提前谢谢你
发布于 2020-12-03 22:20:38
你可以试试这样的东西。由于像Jas这样的东西,这要困难得多。
library(dplyr)
library(purrr)
df %>%
mutate(across(where(is.factor), as.character)) %>%
mutate(across(everything(), ~str_split(.x, " "))) %>%
mutate(
House_Number = map2(House_Number, Street, function(x,y) x[!map_lgl(x, ~any(str_detect(y, .x)))]),
Postcode = map2(Postcode, Street, function(x,y) x[!map_lgl(x, ~any(str_detect(y, .x)))])
) %>%
mutate(across(everything(), ~map_chr(.x, ~str_c(.x, collapse = " "))))提供:
House_Number Street Postcode
1 11 Pascale Street AB1 2BY
2 14 Jasmine Court AC2 3DF
3 24 Baker Street DF4 5TH
4 38 Home Close FG4 8TG
5 Flat 6, 85 The Strand CF5 6YH
6 Flat 9 28 Lake Close DH7 8UJ
7 38 Hightower Close BG6 7HU 请注意,如果有输入错误或小写/大写,这将不起作用。此外,有时这可能会删除一些您不想删除的内容,这真的取决于用例。
https://stackoverflow.com/questions/65126995
复制相似问题