我试图在线性模型上调整对比度编码,在这个模型中,我想知道每个因子的水平是否与大平均值有很大的不同。
假设因子有"A“、"B”和"C“级别。最常见的控制-治疗对比明显设置为"A“作为参考水平,并将"B”和"C“与之进行比较。这不是我想要的,因为"A“级没有出现在模型摘要中。
偏差编码似乎也没有给我想要的东西,因为它将"C“级的对比度矩阵设置为[-1,-1,-1],而现在这个级别没有出现在模型摘要中。
set.seed(1)
y <- rnorm(6, 0, 1)
x <- factor(rep(LETTERS[1:3], each = 2))
fit <- lm(y ~ x, contrasts = list(x = contr.sum))
summary(fit)此外,报告的级别名称已从"A“、"B”改为"1“和"2”。
Call:
lm(formula = y ~ x, contrasts = list(x = contr.sum))
Residuals:
1 2 3 4 5 6
-0.405 0.405 -1.215 1.215 0.575 -0.575
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02902 0.46809 -0.062 0.954
x1 -0.19239 0.66198 -0.291 0.790
x2 0.40885 0.66198 0.618 0.581
Residual standard error: 1.147 on 3 degrees of freedom
Multiple R-squared: 0.1129, Adjusted R-squared: -0.4785
F-statistic: 0.1909 on 2 and 3 DF, p-value: 0.8355我是不是遗漏了什么?我是否应该添加一个等于大平均值的虚拟变量,以便我可以使用它作为引用级别?
去年,我看到了一个类似的问题(但可能要求更高),但还没有解决方案:models with 'differences from mean' for all coefficients on categorical variables; get 'contrast coding' to do it?。我已经悬赏了它,希望引起人们的注意。
发布于 2022-06-30 22:32:52
这只是一个初步的功能化版本@Zheyuan's惊人的答案。正如@浙原也提到的,这个答案也有局限性,包括复杂的模型和对比,等等。我认为如果这个因素是有序的,这个函数就不起作用。
乳头数据:
set.seed(1)
test.df <- data.frame(y = rnorm(10, 0, 1),
x = factor(rep(LETTERS[1:5], each = 2)))
test.fit <- lm(y ~ x, contrasts = list(x=contr.sum), data= test.df)返回级别统计数据的函数:
# Output deviation coded factor matrix, used within `DevContrStats()`
ContrSumMat <- function (fctr, sparse = FALSE) {
if (!is.factor(fctr)) stop("'fctr' is not a factor variable!")
N <- nlevels(fctr)
Cmat <- contr.sum(N, sparse = sparse)
dimnames(Cmat) <- list(levels(fctr), seq_len(N - 1))
Cmat
}
## `dat` is the original data
## `fit.model` is the linear model created from the data
## `fctr_col` is the colum you'd like to get all factor levels from
DevContrStats <- function(dat, fit.model, fctr_col) {
fctr.mtx <- ContrSumMat(factor(dat[, fctr_col]))
## coefficients After Contrasts
coef_a_contr <- coef(fit.model)[-1]
## coefficients Before Contrasts
## coef_bc tells how each factor level deviates from the grand mean.
coef_b_contr <- (fctr.mtx %*% coef_a_contr)[, 1]
## Covariance matrix after contrasts:
var_a_contr <- vcov(fit.model)[-1,-1]
## Transform covariance matrix (after contrasts) to the one before contrasts:
## The diagonal of var_bc gives the estimated variance of factor-level deviations.
var_b_contr <- fctr.mtx %*% var_a_contr %*% t(fctr.mtx)
## standard error of point estimate `coef_bc`
std.err_b_contr <- sqrt(diag(var_b_contr))
## t-statistics (Null Hypothesis: coef_bc = 0)
t.stats_b_contr <- coef_b_contr / std.err_b_contr
## p-values of the t-statistics
p.value_b_contr <- 2 * pt(abs(t.stats_b_contr),
df = fit.model$df.residual,
lower.tail = FALSE)
## construct a coefficient table that mimics `coef(summary(fit))`
stats.tab_b_contr <- cbind("Estimate" = coef_b_contr,
"Std. Error" = std.err_b_contr,
"t value" = t.stats_b_contr,
"Pr(>|t|)" = p.value_b_contr)
## extract statistics of the intercept, which = grand mean
intercept.stats <- coef(summary(fit.model))[1, , drop = FALSE]
## augment the coefficient table with the intercept stats
stats.tab <- rbind(intercept.stats, stats.tab_b_contr)
## print stats table with significance stars
printCoefmat(stats.tab)
}
17:20:20> DevContrStats(test.df, test.fit, 'x')
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.164183 0.245068 0.6699 0.53258
A 0.448694 0.490137 0.9154 0.40195
B -0.028986 0.490137 -0.0591 0.95513
C 0.786629 0.490137 1.6049 0.16942
D -1.582153 0.490137 -3.2280 0.02326 *
E 0.375816 0.490137 0.7668 0.47785
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1发布于 2022-06-30 18:15:27
不如做一系列的t检验,每个测试将子集的平均值与大平均值进行比较?
lapply(levels(x), \(i) t.test(y,y[x==i]))https://stackoverflow.com/questions/72820236
复制相似问题