文章/答案/技术大牛

发布

社区首页 >问答首页 >根据三个类别的最高相似度匹配两个组中的人员

问根据三个类别的最高相似度匹配两个组中的人员
EN

Stack Overflow用户

提问于 2020-12-09 11:19:26

回答 2查看 45关注 0票数 0

这是我关于堆栈溢出的第一个问题，请耐心等待。

我希望根据三个类别(“科目”、“行业”和“地理”)的最高相似度，将第一组到第二组的人进行匹配。我已经提供了一个例子来说明我正在寻找的东西：

group   Name       Subjects                            Sectors  Geography  
1       Hannah     Science, Fisheries, Policy          F, S     North
1       Zach       Policy, Energy, Marine              S, N     South   
2       Chelsea    Energy, Marine, Fisheries           S, N     South
2       Titus      Science, Fisheries, Communication   F, S, N  West

#Matches
Hannah:Titus
Zach:Chelsea

我已经在互联网上搜索了如何使用R执行这种类型的匹配但没有成功的任何示例。我找到的最接近的是一个约会算法(https://algorithmia.com/algorithms/matching/DatingAlgorithm)，但它有一些限制，使我无法编辑他们的示例数据。我在R方面有一些经验，但没有太多经验，因此任何建议(特别是基本建议)都将不胜感激。如果需要的话，我很乐意详细说明。谢谢!

matching

回答 2

Stack Overflow用户

发布于 2020-12-09 11:31:56

R find.matches函数看起来可以帮助解决这种类型的问题：“将x中的每一行与y中的所有行进行比较，找出y中的行和x中给定行的所有列的公差范围内的所有列。”

该函数的文档包括以下代码示例...(出于您的目的，您将为x和y指定文本值，而不是数字值)

https://www.rdocumentation.org/packages/Hmisc/versions/4.4-1/topics/find.matches

y <- rbind(c(.1, .2),c(.11, .22), c(.3, .4), c(.31, .41), c(.32, 5))
x <- rbind(c(.09,.21), c(.29,.39))
y
x
w <- find.matches(x, y, maxmatch=5, tol=c(.05,.05))


set.seed(111)       # so can replicate results
x <- matrix(runif(500), ncol=2)
y <- matrix(runif(2000), ncol=2)
w <- find.matches(x, y, maxmatch=5, tol=c(.02,.03))
w$matches[1:5,]
w$distance[1:5,]
# Find first x with 3 or more y-matches
num.match <- apply(w$matches, 1, function(x)sum(x > 0))
j <- ((1:length(num.match))[num.match > 2])[1]
x[j,]
y[w$matches[j,],]

对于匹配功能的更健壮的处理-您可能希望探索optmatch和RITools -

https://cran.r-project.org/web/packages/optmatch/index.html

https://cran.r-project.org/web/packages/RItools/index.html

本文将对此进行讨论。

https://cran.r-project.org/web/packages/optmatch/vignettes/fullmatch-vignette.pdf

此外，您可能会找到Jasjeet S.Sekhon的感兴趣的论文(用于因果推理的多元和倾向分数匹配软件)-使用他的R匹配软件包：

https://cran.r-project.org/web/packages/Matching/Matching.pdf

http://sekhon.berkeley.edu/papers/MatchingJSS.pdf

http://sekhon.berkeley.edu/matching/Match.html

http://sekhon.berkeley.edu/matching/

票数 1

Stack Overflow用户

发布于 2020-12-09 11:52:03

可能不是最有效的解决方案，如果条目数量变大，它可能不会起作用。

我创建了一个函数来计算感兴趣的3列中的匹配数。然后，我生成所有可能的对并计算成对距离。

library(purrr)

distance <- function(x, y){
  dist_subjects <- length(intersect(x$lSubjects[[1]], y$lSubjects[[1]]))
  dist_sectors <- length(intersect(x$lSectors[[1]], y$lSectors[[1]]))
  dist_geography <- sum(x$Geography == y$Geography)
  sum(dist_subjects, dist_sectors, dist_geography)
}

psort <- function(a, b){
  # parallel sort each pair from 2 vectors and paste them together in order
  out <- ifelse(a < b, paste0(a,":",b), paste0(b,":",a))
  out
}


# format as list for convenience
df$lSubjects <- strsplit(df$Subjects, ", ")
df$lSectors <- strsplit(df$Sectors, ", ")



all_pairs <- expand.grid(first = transpose(df),
            second = transpose(df))

# filter out the pairs of someone with themselves
all_pairs <- all_pairs[!map2_lgl(all_pairs$first, all_pairs$second,
                                 ~ .x$Name == .y$Name),]
# filter out duplicate pairs (same names in different order)
all_pairs$pair_name <- map2_chr(all_pairs$first, all_pairs$second, ~psort(.x$Name,.y$Name))
all_pairs <- all_pairs[! duplicated(all_pairs$pair_name), ]

setNames(map2_int(all_pairs$first, all_pairs$second, distance),
          all_pairs$pair_name)
#>    Hannah:Zach Chelsea:Hannah   Hannah:Titus   Chelsea:Zach     Titus:Zach 
#>              0              0              2              2              0 
#>  Chelsea:Titus 
#>              0

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65210155

复制

相似问题

问根据三个类别的最高相似度匹配两个组中的人员
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据三个类别的最高相似度匹配两个组中的人员EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问根据三个类别的最高相似度匹配两个组中的人员
EN