我正在做下面的数据帧。
File1:
UniqueId Match_ID Parameters Value
UID-245 xty-tt09-23-ert Age 28
UID-245 xty-tt09-23-ert Sex Male
UID-245 xty-tt09-23-ert Height 5.7
UID-245 xty-tt09-23-ert Balance 200
UID-246 xty-tt09-03-ert Balance 303
UID-246 xty-tt09-03-ert Weight 89File2:
UniqueId Match_ID Parameters Value
UID-245 xty-tt09-23-ert Age 29
UID-245 xty-tt09-23-ert Sex Female
UID-245 xty-tt09-23-ert Height 5.7
UID-245 xty-tt09-23-ert Balance 200
UID-246 xty-tt09-03-ert Balance 300我需要在不同的UniqueId基础上比较数据帧和所需的以下输出。
其中,我们需要将File1参数值与File2进行比较,以检查两者是否完全匹配。此外,我们需要在输出数据帧中创建一个逗号分隔的不匹配参数字符串作为Variable_Mismatched。
UniqueID Count_File1 Count_File2 Matched_Parameters Mismatched_Parameters Variable_Mismatched
UID-245 4 4 2 2 Age,Sex
UID-246 2 1 0 1 Balance发布于 2021-04-06 18:01:22
这是一种data.table方法。在代码的注释中有解释
确保您的File1和File2在data.table-format中。这可以通过使用data.table::fread()或在现有data.frames上使用data.table::setDT()读取中的文件来完成。
library(data.table)
library(tibble) #for lst-function
# Sample data -----
File1 <- fread("UniqueId Match_ID Parameters Value
UID-245 xty-tt09-23-ert Age 28
UID-245 xty-tt09-23-ert Sex Male
UID-245 xty-tt09-23-ert Height 5.7
UID-245 xty-tt09-23-ert Balance 200
UID-246 xty-tt09-03-ert Balance 303
UID-246 xty-tt09-03-ert Weight 89")
File2 <- fread("UniqueId Match_ID Parameters Value
UID-245 xty-tt09-23-ert Age 29
UID-245 xty-tt09-23-ert Sex Female
UID-245 xty-tt09-23-ert Height 5.7
UID-245 xty-tt09-23-ert Balance 200
UID-246 xty-tt09-03-ert Balance 300")
# Part 1: get counts per file & UniqueId -----
L <- tibble::lst(File1, File2)
DT <- rbindlist(L, use.names = TRUE, fill = TRUE, idcol = "File")
# Summarise by UniqueId to get file-count, the first part of the answer
ans1 <- dcast(DT[, .N, by = .(File, UniqueId)],
UniqueId ~ paste0("Count_", File),
value.var = "N")
# Part 2: Find matches and mismatches -----
ans2 <- copy(File1)[ File2, Value2 := i.Value, on = .(UniqueId, Parameters)]
# Only keep complete cases
ans2 <- ans2[!is.na(Value) & !is.na(Value2), ]
ans2 <- ans2[, .(Matched_Parameters = sum(Value == Value2),
Mismatched_Parameters = sum(!Value == Value2),
Variable_Mismatched = paste0(Parameters[!Value == Value2],
collapse = ","),
Variable_Matched = paste0(Parameters[Value == Value2],
collapse = ",")),
by = .(UniqueId)]
# Part 3: Join together -----
ans1[ans2, on = .(UniqueId)]
# UniqueId Count_File1 Count_File2 Matched_Parameters Mismatched_Parameters Variable_Mismatched Variable_Matched
# 1: UID-245 4 4 2 2 Age,Sex Height,Balance
# 2: UID-246 2 1 0 1 Balancehttps://stackoverflow.com/questions/66964378
复制相似问题