我有一个相对复杂的表合并/扩展问题。下面我包含了一个示例DATA和所需的RESULT表。我有4个因素(SITE、DATE、SAMPLE、TAXA)和三个数字列(1、2、3)。我需要每个SITE、DATE和SAMPLE都有TAXA 1、2、100和150。通过这个过程,我需要用适当的信息填充空因子单元格,并用0填充数字列。
我为大型的“示例”数据集道歉,但它们捕获了我的数据集的复杂性。我的完整数据集有点大,包括4 SITE、15 DATE、12 SAMPLE和167 TAXA。使用dplyr的解决方案是首选的,但我当然对其他选项持开放态度。用excel做这件事需要一只浣熊的年龄!提前谢谢。
DATA
SITE DATE SAMPLE TAXA 1 2 3
NSV 8-Jul-13 Pool 1 10 10 10
NSV 8-Jul-13 Pool 2 10 10 10
NSV 8-Jul-13 Riffle 1 10 10 10
NSV 8-Jul-13 Riffle 2 10 10 10
NSV 23-Oct-13 Pool 1 10 10 10
NSV 23-Oct-13 Pool 2 10 10 10
NSV 23-Oct-13 Riffle 1 10 10 10
NSV 23-Oct-13 Riffle 2 10 10 10
SFP 4-Jul-13 Pool 1 10 10 10
SFP 4-Jul-13 Pool 2 10 10 10
SFP 4-Jul-13 Riffle 1 10 10 10
SFP 4-Jul-13 Riffle 2 10 10 10
SFP 27-Oct-13 Pool 1 10 10 10
SFP 27-Oct-13 Pool 2 10 10 10
SFP 27-Oct-13 Pool 100 10 10 10
SFP 27-Oct-13 Pool 150 10 10 10
SFP 27-Oct-13 Riffle 1 10 10 10
SFP 27-Oct-13 Riffle 2 10 10 10
SFP 27-Oct-13 Riffle 100 10 10 10
SFP 27-Oct-13 Riffle 150 10 10 10
RESULT
SITE DATE SAMPLE TAXA 1 2 3
NSV 8-Jul-13 Pool 1 10 10 10
NSV 8-Jul-13 Pool 2 10 10 10
NSV 8-Jul-13 Pool 100 0 0 0
NSV 8-Jul-13 Pool 150 0 0 0
NSV 8-Jul-13 Riffle 1 10 10 10
NSV 8-Jul-13 Riffle 2 10 10 10
NSV 8-Jul-13 Riffle 100 0 0 0
NSV 8-Jul-13 Riffle 150 0 0 0
NSV 23-Oct-13 Pool 1 10 10 10
NSV 23-Oct-13 Pool 2 10 10 10
NSV 23-Oct-13 Pool 100 0 0 0
NSV 23-Oct-13 Pool 150 0 0 0
NSV 23-Oct-13 Riffle 1 10 10 10
NSV 23-Oct-13 Riffle 2 10 10 10
NSV 23-Oct-13 Riffle 100 0 0 0
NSV 23-Oct-13 Riffle 150 0 0 0
SFP 4-Jul-13 Pool 1 10 10 10
SFP 4-Jul-13 Pool 2 10 10 10
SFP 4-Jul-13 Pool 100 0 0 0
SFP 4-Jul-13 Pool 150 0 0 0
SFP 4-Jul-13 Riffle 1 10 10 10
SFP 4-Jul-13 Riffle 2 10 10 10
SFP 4-Jul-13 Riffle 100 0 0 0
SFP 4-Jul-13 Riffle 150 0 0 0
SFP 27-Oct-13 Pool 1 10 10 10
SFP 27-Oct-13 Pool 2 10 10 10
SFP 27-Oct-13 Pool 100 10 10 10
SFP 27-Oct-13 Pool 150 10 10 10
SFP 27-Oct-13 Riffle 1 10 10 10
SFP 27-Oct-13 Riffle 2 10 10 10
SFP 27-Oct-13 Riffle 100 10 10 10
SFP 27-Oct-13 Riffle 150 10 10 10发布于 2015-04-23 03:35:07
这是一个非dplyr的解决方案.我相信有更优雅的方法,但这里是一个基本的R方法。我打电话给你的输入data.frame d
d2 <- expand.grid(apply(unique(d[,c("SITE","DATE")]), 1, paste, collapse=" "),
unique(d$SAMPLE), unique(d$TAXA))
d2 <- cbind(matrix(unlist(strsplit(as.character(d2$Var1), " ")), ncol=2, byrow=TRUE),
d2[,2:3])
names(d2)<-names(d)[1:4]
d2 <- merge(d2,d, all.x=TRUE)
d2[which(is.na(d2), arr.ind=TRUE)] <- 0产出:
SITE DATE SAMPLE TAXA X1 X2 X3
1 NSV 23-Oct-13 Pool 1 10 10 10
2 NSV 23-Oct-13 Pool 2 10 10 10
3 NSV 23-Oct-13 Pool 100 0 0 0
4 NSV 23-Oct-13 Pool 150 0 0 0
5 NSV 23-Oct-13 Riffle 1 10 10 10
6 NSV 23-Oct-13 Riffle 2 10 10 10
7 NSV 23-Oct-13 Riffle 100 0 0 0
8 NSV 23-Oct-13 Riffle 150 0 0 0
9 NSV 8-Jul-13 Pool 1 10 10 10
10 NSV 8-Jul-13 Pool 2 10 10 10
11 NSV 8-Jul-13 Pool 100 0 0 0
12 NSV 8-Jul-13 Pool 150 0 0 0
13 NSV 8-Jul-13 Riffle 1 10 10 10
14 NSV 8-Jul-13 Riffle 2 10 10 10
15 NSV 8-Jul-13 Riffle 100 0 0 0
16 NSV 8-Jul-13 Riffle 150 0 0 0
17 SFP 27-Oct-13 Pool 1 10 10 10
18 SFP 27-Oct-13 Pool 2 10 10 10
19 SFP 27-Oct-13 Pool 100 10 10 10
20 SFP 27-Oct-13 Pool 150 10 10 10
21 SFP 27-Oct-13 Riffle 1 10 10 10
22 SFP 27-Oct-13 Riffle 2 10 10 10
23 SFP 27-Oct-13 Riffle 100 10 10 10
24 SFP 27-Oct-13 Riffle 150 10 10 10
25 SFP 4-Jul-13 Pool 1 10 10 10
26 SFP 4-Jul-13 Pool 2 10 10 10
27 SFP 4-Jul-13 Pool 100 0 0 0
28 SFP 4-Jul-13 Pool 150 0 0 0
29 SFP 4-Jul-13 Riffle 1 10 10 10
30 SFP 4-Jul-13 Riffle 2 10 10 10
31 SFP 4-Jul-13 Riffle 100 0 0 0
32 SFP 4-Jul-13 Riffle 150 0 0 0发布于 2015-04-23 03:26:51
从您的数据开始:
dat <- structure(list(SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
.Label = c("NSV", "SFP"), class = "factor"),
DATE = structure(c(4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
.Label = c("23-Oct-13", "27-Oct-13", "4-Jul-13", "8-Jul-13"
), class = "factor"),
SAMPLE = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Pool", "Riffle"), class = "factor"),
TAXA = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 100L, 150L, 1L, 2L, 100L, 150L),
v1 = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L),
v2 = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L),
v3 = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L)),
.Names = c("SITE", "DATE", "SAMPLE", "TAXA", "v1", "v2", "v3"),
class = "data.frame", row.names = c(NA, -20L))一种技术,使用dplyr
library(dplyr)
eg <- do.call('expand.grid', lapply(dat[,1:4], unique))
result <- right_join(dat, eg, by=c('SITE', 'DATE', 'SAMPLE', 'TAXA')) %>%
mutate(v1 = ifelse(is.na(v1), 0, v1),
v2 = ifelse(is.na(v2), 0, v2),
v3 = ifelse(is.na(v3), 0, v3)) %>%
arrange(SITE, DATE, SAMPLE, TAXA)
head(result, n=8)
## SITE DATE SAMPLE TAXA v1 v2 v3
## 1 NSV 23-Oct-13 Pool 1 10 10 10
## 2 NSV 23-Oct-13 Pool 2 10 10 10
## 3 NSV 23-Oct-13 Pool 100 0 0 0
## 4 NSV 23-Oct-13 Pool 150 0 0 0
## 5 NSV 23-Oct-13 Riffle 1 10 10 10
## 6 NSV 23-Oct-13 Riffle 2 10 10 10
## 7 NSV 23-Oct-13 Riffle 100 0 0 0
## 8 NSV 23-Oct-13 Riffle 150 0 0 0arrange的使用只是像您的结果那样排列它,但是数据是完整的。
编辑
我意识到我在结果的data.frame中太多了。基于@Frank的评论,这更正确,而且更紧凑(arrange仍然是可选的):
dat %>% select(SITE, DATE, SAMPLE) %>% unique() %>%
merge(y=list(TAXA=unique(dat$TAXA)), all.x=TRUE) %>%
arrange(SITE, DATE, SAMPLE, TAXA)
## SITE DATE SAMPLE TAXA
## 1 NSV 23-Oct-13 Pool 1
## 2 NSV 23-Oct-13 Pool 2
## 3 NSV 23-Oct-13 Pool 100
## 4 NSV 23-Oct-13 Pool 150
## 5 NSV 23-Oct-13 Riffle 1
## 6 NSV 23-Oct-13 Riffle 2
## 7 NSV 23-Oct-13 Riffle 100
## 8 NSV 23-Oct-13 Riffle 150
## ...snip...https://stackoverflow.com/questions/29812860
复制相似问题