我正在尝试加入从NHTSA解码的VIN数据与车辆数据从fueleconomy.gov使用年份,制造,和模型。下面是我试图加入的数据的一个例子:
# This is the first dataframe
make <- c("PORSCHE", "TESLA", "MITSUBISHI")
model <- c("Cayenne", "Model S", "Outlander - PHEV")
year <- c(2017, 2013, 2018)
electrification_level <- (PHEV, BEV, PHEV)
vin_data <- data.frame(make, model, year, electrification_level)
# This is the second dataframe
make <- c("Porsche", "Tesla", "Mitsubishi")
# There are multiple versions of the models (an average of these would be ideal - e.g. avg. mpg)
model <- c("Cayenne S e-Hybrid", "Model S AWD - P85D", "Outlander 2WD")
year <- c(2017, 2013, 2018)
# These mpg are made up for the example
mpg <- c(75, 120, 80)
fueleconomy_data <- data.frame(make, model, year, mpg) 我面临多个问题,试图完成这个连接。
为了解答这个谜题,我参考了以下几个问题:
我还联系了fueleconomy.gov和NHTSA,看看它们是否有能力基于车辆ID连接数据,但我想问社区是否也有一个简单的解决方案。
发布于 2019-12-17 16:58:52
在您的reprex中有几个排版,所以我在下面再粘贴它。
# This is the first dataframe
make <- c("PORSCHE", "TESLA", "MITSUBISHI")
model <- c("Cayenne", "Model S", "Outlander - PHEV")
year <- c(2017, 2013, 2018)
electrification_level <- c("PHEV", "BEV", "PHEV")
vin_data <- data.frame(make, model, year, electrification_level, stringsAsFactors = FALSE)
# This is the second dataframe
make <- c("Porsche", "Tesla", "Mitsubishi")
# There are multiple versions of the models (an average of these would be ideal - e.g. avg. mpg)
model <- c("Cayenne S e-Hybrid", "Model S AWD - P85D", "Outlander 2WD")
year <- c(2017, 2013, 2018)
# These mpg are made up for the example
mpg <- c(75, 120, 80)
fueleconomy_data <- data.frame(make, model, year, mpg, stringsAsFactors = FALSE) 对于第一个问题,我只需使用toupper函数将它们全部更改为大写,然后使用完全连接。
df_joined <- vin_data %>%
full_join(fueleconomy_data %>%
dplyr::mutate(make = base::toupper(make)), by = "make")对于#2,您可以使用一些if/ use逻辑。我试过了,但你可以把它调整得心满意足。
df_joined %>%
dplyr::mutate(model_same = if_else(condition = word(model.x) == word(model.y), true = TRUE, false = FALSE))发布于 2019-12-17 17:24:34
您可以使用RecordLinkae包来获得所需的东西。您可以使用表值0.6来增加或降低文本匹配的准确性。
library(RecordLinkage)
pairs <- compare.linkage(fueleconomy_data, vin_data, strcmp = 2, exclude=c(3,4), blockfld = 1)
epiwt <- epiWeights(pairs)
epiclass <- epiClassify(epiwt, .6)
getPairs(epiclass, show="links", single.rows=T)make.1 model.1 make.2 model.2 year.2 electrification_level.2 3 MITSUBISHI Outlander 2WD MITSUBISHI Outlander - PHEV 2018 PHEV 1 PORSCHE Cayenne S e-Hybrid PORSCHE Cayenne 2017 PHEV 2 TESLA Model S AWD - P85D TESLA Model S 2013 BEV
这是为了使列大写。您需要在记录链接之前这样做。
vin_data$make <- toupper(vin_data$make) fueleconomy_data$make <- toupper(fueleconomy_data$make)
https://stackoverflow.com/questions/59378237
复制相似问题