我有一个data.table,列中包含职业名称。我想找出重复的职业,但它们是按相反顺序写的(例如作家、广告和广告撰稿人)。以下是我的数据的简化版本和我想要得到的结果
data = data.table(
ID = as.character(c("advertisings writer","writer advertisings","setter","drill setter","setter drill","agent claims","claims agent","engineer"))
)
data_result = data.table(
ID = as.character(c("advertisings writer","setter","drill setter","agent claims","engineer"))
)这是我一直在使用的代码。
data[,b:= strsplit(ID," ")]
data <- data[,.(b=unlist(b)),by = setdiff(names(data),'b')]
setorderv(data,cols=c("ID","b"))
data <- data[,bb:=list(list(unique(b))),by="ID"][,.SD[1],by=c("ID"),.SDcols=c("bb")]
data[,b:=lapply(bb,paste,collapse=' ')]
data[,b:=unlist(b)]
unique(data,by="b")由于我正在处理一个相当大的数据集,这种方法非常耗时。
谢谢
发布于 2021-04-05 20:46:00
一种可能的data.table解决方案
G 210
library(data.table)
data[,ID:=sapply(sapply(stringr::str_split(ID,' '),sort),function(x) paste(x,collapse=' '))]
unique(data)
ID
1: advertisings writer
2: setter
3: drill setter
4: agent claims
5: engineer发布于 2021-04-05 23:03:40
这里有一个igraph选项
library(dplyr)
library(igraph)
data[, TO := gsub("(\\w+)\\s(\\w+)", "\\2 \\1", ID)] %>%
graph_from_data_frame(directed = FALSE) %>%
get.data.frame() %>%
unique() %>%
subset(select = from)这给
from
1 advertisings writer
3 setter
4 drill setter
6 agent claims
8 engineerhttps://stackoverflow.com/questions/66959464
复制相似问题