我在写一个拼写修正函数。我从维基百科上抓取了拼写变体页面,并把它转换成了一个表格。现在,我希望将它用作查找表(拼写),并替换文档中的值(skills.db)。注意:下面的技能数据框架只是一个例子。忽略第二列。我将在简历处理过程中更早地执行拼写更正。简历很大,所以我想我应该和大家分享一下。
--我可以使用下面的for循环来实现这一点,但是我想知道是否有更好的解决方案
spellings = structure(list(preferred_spellings = c("organisation", "acknowledgement",
"cypher", "anaesthesia", "analyse"), other_spellings = c(" organization",
" acknowledgment", " cipher", " anesthesia", " analyze")), row.names = c(NA,
5L), class = "data.frame")
skills.db = structure(list(skills = c("variance analysis static", "analyze kpi",
"financial analysis", "variance analysis", "organizational",
"analysis", "organize", "result analysis", "analytic", "datum analysis",
"analytics", "business analysis", "organized", "quantitative analysis",
"train need analysis", "analytic think", "analysis trial preparation",
"analyze statue", "google analytics", "service analysis", "organize individual",
"account analysis", "analyze department work", "pareto analysis train",
"organization", "ratio analysis", "statistical analysis", "project organization",
"organize client's file", "with good analytic", "nielsen analytics",
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics",
"market analysis", "analyse", "analytic skill", "superb analytic",
"financial statement analysis", "credit analysis", "quick analysis",
"organizational development", "outstanding financial analytic",
"organization design", "organize conference", "business analytics",
"industry analysis", "fs analysis", "analyze", "cash flow analysis",
"investment analysis", "technical analysis bloomberg", "community organize",
"monthly financial analysis", "expense variance analysis", "stock analysis"
), level1 = c("variance analysis static", "analyze kpi", "financial analysis",
"variance analysis", "organizational", "analysis", "organize",
"result analysis", "analytic", "datum analysis", "analytics",
"business analysis", "organized", "quantitative analysis", "train need analysis",
"analytic think", "analysis trial preparation", "analyze statue",
"google analytics", "service analysis", "organize individual",
"account analysis", "analyze department work", "pareto analysis train",
"organization", "ratio analysis", "statistical analysis", "project organization",
"organize client's file", "with good analytic", "nielsen analytics",
"datum analytics", "textual analytics", "social analytics", "business intelligence analytics",
"market analysis", "analyse", "analytic skill", "superb analytic",
"financial statement analysis", "credit analysis", "quick analysis",
"organizational development", "outstanding financial analytic",
"organization design", "organize conference", "business analytics",
"industry analysis", "fs analysis", "analyze", "cash flow analysis",
"investment analysis", "technical analysis bloomberg", "community organize",
"monthly financial analysis", "expense variance analysis", "stock analysis"
)), row.names = c(49L, 65L, 77L, 82L, 155L, 190L, 215L, 244L,
246L, 260L, 287L, 300L, 311L, 323L, 349L, 356L, 378L, 386L, 447L,
607L, 622L, 664L, 686L, 766L, 824L, 832L, 895L, 922L, 928L, 949L,
1020L, 1054L, 1079L, 1080L, 1081L, 1088L, 1146L, 1158L, 1228L,
1248L, 1319L, 1366L, 1385L, 1440L, 1468L, 1475L, 1509L, 1554L,
1584L, 1606L, 1635L, 1658L, 1660L, 1696L, 1760L, 1762L, 1798L
), class = "data.frame")
for(i in 1:nrow(spellings)){
skills.db = skills.db %>% mutate(TEST = gsub(spellings$other_spellings[i], spellings$preferred_spellings[i], skills))
} 发布于 2021-01-04 20:09:23
这里有一种方法,使用Reduce (很容易成为purrr::reduce)来迭代每个拼写并更正它们。
spellings_list <- asplit(spellings, 1)
skills.db %>%
mutate(TEST = Reduce(function(txt, spl) gsub(spl[2], spl[1], txt), spellings_list, init = skills), changed = (skills != TEST))
# skills level1 TEST changed
# 1 variance analysis static variance analysis static variance analysis static FALSE
# 2 analyze kpi analyze kpi analyse kpi TRUE
# 3 financial analysis financial analysis financial analysis FALSE
# 4 variance analysis variance analysis variance analysis FALSE
# 5 organizational organizational organisational TRUE
# 6 analysis analysis analysis FALSE
# 7 organize organize organize FALSE
# 8 result analysis result analysis result analysis FALSE
# 9 analytic analytic analytic FALSE
# 10 datum analysis datum analysis datum analysis FALSE
# 11 analytics analytics analytics FALSE
# 12 business analysis business analysis business analysis FALSE
# 13 organized organized organized FALSE
# 14 quantitative analysis quantitative analysis quantitative analysis FALSE
# 15 train need analysis train need analysis train need analysis FALSE
# 16 analytic think analytic think analytic think FALSE
# 17 analysis trial preparation analysis trial preparation analysis trial preparation FALSE
# 18 analyze statue analyze statue analyse statue TRUE
# 19 google analytics google analytics google analytics FALSE
# 20 service analysis service analysis service analysis FALSE
# 21 organize individual organize individual organize individual FALSE
# 22 account analysis account analysis account analysis FALSE
# 23 analyze department work analyze department work analyse department work TRUE
# 24 pareto analysis train pareto analysis train pareto analysis train FALSE
# 25 organization organization organisation TRUE
# 26 ratio analysis ratio analysis ratio analysis FALSE
# 27 statistical analysis statistical analysis statistical analysis FALSE
# 28 project organization project organization project organisation TRUE
# 29 organize client's file organize client's file organize client's file FALSE
# 30 with good analytic with good analytic with good analytic FALSE
# 31 nielsen analytics nielsen analytics nielsen analytics FALSE
# 32 datum analytics datum analytics datum analytics FALSE
# 33 textual analytics textual analytics textual analytics FALSE
# 34 social analytics social analytics social analytics FALSE
# 35 business intelligence analytics business intelligence analytics business intelligence analytics FALSE
# 36 market analysis market analysis market analysis FALSE
# 37 analyse analyse analyse FALSE
# 38 analytic skill analytic skill analytic skill FALSE
# 39 superb analytic superb analytic superb analytic FALSE
# 40 financial statement analysis financial statement analysis financial statement analysis FALSE
# 41 credit analysis credit analysis credit analysis FALSE
# 42 quick analysis quick analysis quick analysis FALSE
# 43 organizational development organizational development organisational development TRUE
# 44 outstanding financial analytic outstanding financial analytic outstanding financial analytic FALSE
# 45 organization design organization design organisation design TRUE
# 46 organize conference organize conference organize conference FALSE
# 47 business analytics business analytics business analytics FALSE
# 48 industry analysis industry analysis industry analysis FALSE
# 49 fs analysis fs analysis fs analysis FALSE
# 50 analyze analyze analyse TRUE
# 51 cash flow analysis cash flow analysis cash flow analysis FALSE
# 52 investment analysis investment analysis investment analysis FALSE
# 53 technical analysis bloomberg technical analysis bloomberg technical analysis bloomberg FALSE
# 54 community organize community organize community organize FALSE
# 55 monthly financial analysis monthly financial analysis monthly financial analysis FALSE
# 56 expense variance analysis expense variance analysis expense variance analysis FALSE
# 57 stock analysis stock analysis stock analysis FALSE我添加changed只是为了一个石蕊,假设您知道哪些输入应该是不同的。
演练:
Reduce将对每一项拼写更正遍历整个skills列。其函数的一个迭代的输入将是前一个迭代的输出,这是一个必要的属性,因此我们可以保留更改。Vectorize,而且Reduce通常喜欢简单的2-参数函数(不是很容易的Map-able),所以我将spellings帧分解为长度-2向量的列表:
spellings_list <- asplit(拼写,1) spellings_list # $1 # preferred_spellings other_spellings #“preferred_spellings”组织#$ $2 # preferred_spellings other_spellings #“确认”确认#$ $3 # preferred_spellings other_spellings #“密码”密码# $4 #preferred_spellings other_spellings #“麻醉剂”麻醉“# $5 # preferred_spellings other_spellings #”分析“分析”
这使我们能够更容易地使用gsub(spl[1], spl[2], ...)。Reduce的艺术在于知道在何处使用哪个参数,以及何时使用init=。这是一门艺术。当我把自己放在一个位置上,我怀疑是在哪里喂食,我插入了一个browser()开始的anon,并运行了几次迭代的缩减。other_spellings与其字符串的任何一侧的\\b混合起来,以防止部分匹配的替换。例如,您的spellings也将取代organizational,即使它实际上并不存在。虽然这可能是需要的,但取决于你的大名单,很容易出现假阳性。(例如,color/colour和Colorado.)(编辑:我最初在spl[1]和spl[2]中交换了gsub。显然,艺术中也有“逻辑”:-)
https://stackoverflow.com/questions/65568874
复制相似问题