一些数据
example_df <- data.frame(
url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
numbs = 1:5
)
lookup_df <- data.frame(
string = c('blog', 'subscription', 'UK'),
group = c('blog', 'subs', 'UK')
)
library(fuzzyjoin)
data_combined <- example_df %>%
fuzzy_left_join(lookup_df, by = c("url" = "string"),
match_fun = `%in%`)
data_combined
url numbs string group
1 blog/blah 1 <NA> <NA>
2 blog/?utm_medium=foo 2 <NA> <NA>
3 blah 3 <NA> <NA>
4 subscription/apples 4 <NA> <NA>
5 UK/something 5 <NA> <NA>我希望data_combined有字符串和组的值,其中有基于match_fun的匹配项。取而代之的是所有的NA。
例如,lookup_df中字符串的第一个值是'blog‘。由于这是博客字符串的第一个值,因此应与字符串和组字段中的值‘%in%’和‘example_df’匹配。
发布于 2021-03-04 03:21:30
如果我们想要与'url‘中的字符串之前的单词与'lookup_df’中的‘/’列进行部分匹配,我们可以提取该子字符串作为新列,然后执行regex_left_join
library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
mutate(string = str_remove(url, "\\/.*")) %>%
regex_left_join(lookup_df, by = 'string') %>%
select(url, numbs, group)-output
# url numbs group
#1 blog/blah 1 blog
#2 blog/?utm_medium=foo 2 blog
#3 blah 3 <NA>
#4 subscription/apples 4 subs
#5 UK/something 5 UKhttps://stackoverflow.com/questions/66463306
复制相似问题