首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在glm中包含线性相关特征

在glm中包含线性相关特征
EN

Stack Overflow用户
提问于 2016-09-25 11:38:43
回答 2查看 674关注 0票数 2

我有一个训练集,x列代表一个特定的体育场,比赛正在进行。显然,在训练集中,列是线性依赖的,因为必须在至少一个体育场中进行匹配。

然而,我的问题是,如果我通过测试数据,它可能包括一个体育场没有看到的训练数据。因此,我希望在训练一个R glm时包括所有x列,这样每个体育场馆系数的平均值都是零。然后,如果看到一个新的体育场,它本质上将得到所有体育场系数的平均值。

我的问题是,them函数似乎检测到我的训练集中有线性相关的列,并将其中一个系数设置为NA,使其余的系数都是线性独立的。我如何:

停止R在glm函数中插入我的一个列的NA系数,并确保所有体育场系数之和为0?

一些示例代码

代码语言:javascript
复制
# Past observations
outcome   = c(1  ,0  ,0  ,1  ,0  ,1  ,0  ,0  ,1  ,0  ,1  )
skill     = c(0.1,0.5,0.6,0.3,0.1,0.3,0.9,0.6,0.5,0.1,0.4)
stadium_1 = c(1  ,1  ,0  ,0  ,0  ,0  ,0  ,0  ,0  ,0  ,0  )
stadium_2 = c(0  ,0  ,1  ,1  ,1  ,1  ,1  ,0  ,0  ,0  ,0  )
stadium_3 = c(0  ,0  ,0  ,0  ,0  ,0  ,0  ,1  ,1  ,1  ,1  )

train_glm_data = data.frame(outcome, skill, stadium_1, stadium_2,     stadium_3)
LR = glm(outcome ~ . - outcome, data = train_glm_data,  family=binomial(link='logit'))
print(predict(LR, type = 'response'))

# New observations (for a new stadium we have not seen before)
skill     = c(0.1)
stadium_1 = c(0  )
stadium_2 = c(0  )
stadium_3 = c(0  )

test_glm_data = data.frame(outcome, skill, stadium_1, stadium_2, stadium_3)
print(predict(LR, test_glm_data, type = 'response'))

# Note that in this case, the observation is simply the same as if we had observed stadium_3
# Instead I would like it to be an average of all the known stadiums coefficients
# If they all sum to 0 this is essentially already done for me
# However if not then the stadium_3 coefficient is buried somewhere in the intercept term
EN

回答 2

Stack Overflow用户

发布于 2016-09-25 12:29:21

代码语言:javascript
复制
train_glm_data$stadium <- NA
train_glm_data$stadium[train_glm_data$stadium_1==1] <- "Stadium 1"
train_glm_data$stadium[train_glm_data$stadium_2==1] <- "Stadium 2"
train_glm_data$stadium[train_glm_data$stadium_3==1] <- "Stadium 3"
train_glm_data$stadium_1 <- NULL
train_glm_data$stadium_2 <- NULL
train_glm_data$stadium_3 <- NULL

train_glm_data$stadium         <- as.factor(train_glm_data$stadium)
levels(train_glm_data$stadium) <- c("Stadium 1", "Stadium 2", "Stadium 3", "Stadium 4")
train_glm_data                 <- rbind(train_glm_data, c(
                                      round(mean(outcome)), mean(skill),
                                      "Stadium 4"
                                    ))
train_glm_data$outcome <- as.numeric(train_glm_data$outcome)
train_glm_data$skill   <- as.numeric(train_glm_data$skill)
LR = glm(outcome ~ stadium + skill, data = train_glm_data,  family=binomial(link='logit'))
print(predict(LR, type = 'response'))

# New observations (for a new stadium we have not seen before)
skill     = c(0.1)
stadium   = "Stadium 4"

test_glm_data = data.frame(skill, stadium)
print(predict(LR, test_glm_data, type = 'response'))

关于如何包含所有级别的系数的问题-- 这样做。它被称为虚拟变量陷阱。如果不排除引用级别,则数据矩阵将变为奇异

唯一的例外是,如果您估计一个无截距模型。了解有关虚拟变量陷阱这里。的更多信息

票数 1
EN

Stack Overflow用户

发布于 2016-09-25 12:29:21

要估计所有虚拟变量的系数,可以在公式中添加"-1“,这将删除截距:

代码语言:javascript
复制
LR = glm(outcome ~ . - outcome - 1, data = train_glm_data, family=binomial(link='logit'))

系数:

代码语言:javascript
复制
coef(LR)
#      skill  stadium_1  stadium_2  stadium_3 
# -2.8080177  0.8424053  0.7541226  1.1313135 

对于看不见的训练水平问题,@hack-r提出了一些好主意。另一个想法是为新观测的所有虚拟变量计算1/n (其中n是观察到的体育场的数量)。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/39686379

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档