我正在处理HCUP数据,这在一个列中有一个值的范围,需要分割成多个列。以下是可供参考的HCUP数据框架:
code label
61000-61003 excision of CNS
0169T-0169T ventricular shunt预期的产出应是:
code label
61000 excision of CNS
61001 excision of CNS
61002 excision of CNS
61003 excision of CNS
0169T ventricular shunt我解决这个问题的方法是使用包splitstackshape并使用以下代码
library(data.table)
library(splitstackshape)
cSplit(hcup, "code", "-")[, list(code = code_1:code_2, by = label)]这种方法会导致内存问题。是否有更好的方法来解决这个问题?
一些评论:
发布于 2015-10-13 23:57:37
下面是一个使用dplyr和Hmisc的all.is.numeric的解决方案
library(dplyr)
library(Hmisc)
library(tidyr)
dat %>% separate(code, into=c("code1", "code2")) %>%
rowwise %>%
mutate(lists = ifelse(all.is.numeric(c(code1, code2)),
list(as.character(seq(from = as.numeric(code1), to = as.numeric(code2)))),
list(code1))) %>%
unnest(lists) %>%
select(code = lists, label)
Source: local data frame [5 x 2]
code label
(chr) (fctr)
1 61000 excision of CNS
2 61001 excision of CNS
3 61002 excision of CNS
4 61003 excision of CNS
5 0169T ventricular shunt要用字符值修正范围的编辑。把简单说得更简单一点:
dff %>% mutate(row = row_number()) %>%
separate(code, into=c("code1", "code2")) %>%
group_by(row) %>%
summarise(lists = if(all.is.numeric(c(code1, code2)))
{list(str_pad(as.character(
seq(from = as.numeric(code1), to = as.numeric(code2))),
nchar(code1), pad="0"))}
else if(grepl("^[0-9]", code1))
{list(str_pad(paste0(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
strsplit(code1, "[0-9]+")[[1]][2]),
nchar(code1), pad = "0"))}
else
{list(paste0(
strsplit(code1, "[0-9]+")[[1]],
str_pad(as.character(
seq(from = extract_numeric(code1), to = extract_numeric(code2))),
nchar(gsub("[^0-9]", "", code1)), pad="0")))},
label = first(label)) %>%
unnest(lists) %>%
select(-row)
Source: local data frame [15 x 2]
label lists
(chr) (chr)
1 excision of CNS 61000
2 excision of CNS 61001
3 excision of CNS 61002
4 ventricular shunt 0169T
5 ventricular shunt 0170T
6 ventricular shunt 0171T
7 excision of CNS 01000
8 excision of CNS 01001
9 excision of CNS 01002
10 some procedure A2543
11 some procedure A2544
12 some procedure A2545
13 some procedure A0543
14 some procedure A0544
15 some procedure A0545数据:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")发布于 2015-10-14 01:27:25
原始答案:查看下面的更新。
首先,通过将第一行添加到底部,使示例数据更具挑战性。
dff <- structure(list(code = c("61000-61003", "0169T-0169T", "61000-61003"
), label = c("excision of CNS", "ventricular shunt", "excision of CNS"
)), .Names = c("code", "label"), row.names = c(NA, 3L), class = "data.frame")
dff
# code label
# 1 61000-61003 excision of CNS
# 2 0169T-0169T ventricular shunt
# 3 61000-61003 excision of CNS我们可以使用序列运算符:来获取code列的序列,用tryCatch()包装以避免出现错误,并保存无法排序的值。首先,我们用破折号-来分割值,然后通过lapply()运行它。
xx <- lapply(
strsplit(dff$code, "-", fixed = TRUE),
function(x) tryCatch(x[1]:x[2], warning = function(w) x)
)
data.frame(code = unlist(xx), label = rep(dff$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0169T ventricular shunt
# 7 61000 excision of CNS
# 8 61001 excision of CNS
# 9 61002 excision of CNS
# 10 61003 excision of CNS我们试图将序列运算符:应用于strsplit()中的每个元素,如果不可能采用x[1]:x[2],则只返回这些元素的值,否则继续执行序列x[1]:x[2]。然后,我们只根据label列的长度复制xx列的值,以获得新的label列。
更新:这里是我针对您的编辑提出的。将上面的xx替换为
xx <- lapply(strsplit(dff$code, "-", TRUE), function(x) {
s <- stringi::stri_locate_first_regex(x, "[A-Z]")
nc <- nchar(x)[1L]
fmt <- function(n) paste0("%0", n, "d")
if(!all(is.na(s))) {
ss <- s[1,1]
fmt <- fmt(nc-1)
if(ss == 1L) {
xx <- substr(x, 2, nc)
paste0(substr(x, 1, 1), sprintf(fmt, xx[1]:xx[2]))
} else {
xx <- substr(x, 1, ss-1)
paste0(sprintf(fmt, xx[1]:xx[2]), substr(x, nc, nc))
}
} else {
sprintf(fmt(nc), x[1]:x[2])
}
})是的,很复杂。现在,如果我们以下面的数据框架df2作为测试用例
df2 <- structure(list(code = c("61000-61003", "0169T-0174T", "61000-61003",
"T0169-T0174"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "ventricular shunt")), .Names = c("code",
"label"), row.names = c(NA, 4L), class = "data.frame") 并从上面运行xx代码,我们可以得到以下结果。
data.frame(code = unlist(xx), label = rep(df2$label, lengths(xx)))
# code label
# 1 61000 excision of CNS
# 2 61001 excision of CNS
# 3 61002 excision of CNS
# 4 61003 excision of CNS
# 5 0169T ventricular shunt
# 6 0170T ventricular shunt
# 7 0171T ventricular shunt
# 8 0172T ventricular shunt
# 9 0173T ventricular shunt
# 10 0174T ventricular shunt
# 11 61000 excision of CNS
# 12 61001 excision of CNS
# 13 61002 excision of CNS
# 14 61003 excision of CNS
# 15 T0169 ventricular shunt
# 16 T0170 ventricular shunt
# 17 T0171 ventricular shunt
# 18 T0172 ventricular shunt
# 19 T0173 ventricular shunt
# 20 T0174 ventricular shunt发布于 2015-10-14 18:33:39
为这类代码创建一个排序规则:
seq_code <- function(from,to){
ext = function(x, part) gsub("([^0-9]?)([0-9]*)([^0-9]?)", paste0("\\",part), x)
pre = unique(sapply(list(from,to), ext, part = 1 ))
suf = unique(sapply(list(from,to), ext, part = 3 ))
if (length(pre) > 1 | length(suf) > 1){
return("NO!")
}
num = do.call(seq, lapply(list(from,to), function(x) as.integer(ext(x, part = 2))))
len = nchar(from)-nchar(pre)-nchar(suf)
paste0(pre, sprintf(paste0("%0",len,"d"), num), suf)
}使用@jeremycg的示例:
setDT(dff)[,.(
label = label[1],
code = do.call(seq_code, tstrsplit(code,'-'))
), by=.(row=seq(nrow(dff)))]这给
row label code
1: 1 excision of CNS 61000
2: 1 excision of CNS 61001
3: 1 excision of CNS 61002
4: 2 ventricular shunt 0169T
5: 2 ventricular shunt 0170T
6: 2 ventricular shunt 0171T
7: 3 excision of CNS 01000
8: 3 excision of CNS 01001
9: 3 excision of CNS 01002
10: 4 some procedure A2543
11: 4 some procedure A2544
12: 4 some procedure A2545
13: 5 some procedure A0543
14: 5 some procedure A0544
15: 5 some procedure A0545从@jeremycg的答复中复制的数据:
dff <- structure(list(code = c("61000-61002", "0169T-0171T", "01000-01002",
"A2543-A2545", "A0543-A0545"), label = c("excision of CNS", "ventricular shunt",
"excision of CNS", "some procedure", "some procedure")), .Names = c("code",
"label"), row.names = c(NA, 5L), class = "data.frame")https://stackoverflow.com/questions/33113263
复制相似问题