我有一个数据集存储为data.table DT,如下所示:
print(DT)
category industry
1: administration admin
2: nurse practitioner truck
3: trucking truck
4: administration admin
5: warehousing nurse
6: warehousing admin
7: trucking truck
8: nurse practitioner nurse
9: nurse practitioner truck 我希望将表缩小为行业与类别匹配的行。我的一般方法是使用grepl()来regex匹配字符串'^{{INDUSTRY}}[a-z ]+$'和DT$category的每一行,在regex字符串中插入相应的DT$industry行来代替{{INDUSTRY}},使用infuse()。我很难找到一个圆滑的data.table解决方案,能够正确地遍历表并进行行内比较,因此我求助于for-循环来完成工作:
template <- "^{{IND}}[a-z ]+$"
DT[,match := FALSE,]
for (i in seq(1,length(DT$category))) {
ind <- DT[i]$industry
categ <- d.daily[i]$category
if (grepl(infuse(IND=ind,template),categ)){
DT[i]$match <- TRUE
}
}
DT<- DT[match==TRUE]
print(DT)
category industry
1: administration admin
2: trucking truck
3: administration admin
4: trucking truck
5: nurse practitioner nurse 不过,我相信这是可以做得更好的。对于如何利用data.table包的功能来实现这一结果,有什么建议吗?我的理解是,在这种情况下,使用包的方法可能比for循环更有效。
发布于 2015-11-13 19:40:06
Data.table擅长分组操作;假设您在同一行业中有许多行,我认为这是有帮助的:
DT[ DT[, .I[grep(industry, category)], by = industry]$V1 ]它使用the current idiom for subsetting by group, thanks to @eddi 。
备注。这些可能会进一步帮助:
by=.(industry,category)。grep的方法(比如肯和理查德的答案中的选项)。发布于 2015-11-13 18:20:25
你可以用stringi::stri_detect_fixed()。它在str和pattern上都是矢量化的。
DT[stringi::stri_detect_fixed(category, industry)]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nurse 或者,可以使用stringr::str_detect()。它还对其两个参数进行向量化。
library(stringr)
DT[str_detect(category, fixed(industry))]或者一个基本的R选项是通过grepl()运行mapply()
DT[mapply(grepl, industry, category, fixed = TRUE)]或者您可以使用Vectorize(grepl)获得相同的结果。
DT[Vectorize(grepl)(industry, category, fixed = TRUE)]所有这些都产生了相同的结果。
数据:
DT <- structure(list(category = c("administration", "nurse practitioner",
"trucking", "administration", "warehousing", "warehousing", "trucking",
"nurse practitioner", "nurse practitioner"), industry = c("admin",
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse",
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA,
-9L))
setDT(DT)发布于 2015-11-13 18:31:08
只要匹配始终基于category字符串的开始,那么这就很好了:
dt[substring(category, 1, nchar(industry)) == industry]
# category industry
# 1: administration admin
# 2: trucking truck
# 3: administration admin
# 4: trucking truck
# 5: nurse practitioner nursehttps://stackoverflow.com/questions/33699122
复制相似问题