我有一个简单的预测。我有12个可能的特征。在发现大多数方差被其中的7个变量捕获后-我使用了caret包中的preProcess -我想只使用这7个变量创建一个线性模型lm。
我运行了preProcess
pp <- preProcess(tr_1,thresh = 0.8,method = "pca")
结果就是PCA needed 7 components to capture 80 percent of the variance
问题是如何仅使用这7个功能来运行模型/预测。
谢谢
发布于 2015-09-23 09:12:39
下面是一个关于如何选择特定数量的PCA components的完整示例。您需要在preProcess中设置pcaComp = 7或使用thresh = 0.8,然后将您的处理应用于训练和测试数据,如下所示。?preProcess提供了更多详细信息。如果您想使用带有train方法的PCA来优化模型,请阅读我在此post中对类似问题的回答。请记住,如果您有categorical variables (factors),您需要首先将它们转换为dummy variables,然后才能应用您的处理(中心、缩放、主成分分析等)。有关创建dummy variables的更多详细信息,请阅读caret网站上的this。
library(caret)
library(MASS)#for the Boston dataset
data(Boston)
#number of samples and predictors (including the outcome)
dim(Boston)
#predictors names (medv is the response)
names(Boston)
#you can find more about the Boston Dataset
?Boston
#Let's split the the data to train and test sets
set.seed(10457)
train_idx <- createDataPartition(Boston$medv, p = 0.75, list = FALSE)
train <- Boston[train_idx,]
test <- Boston[-train_idx,]
#Now using preProcess, you need to set the pcaComp = 7, or thresh = 0.8
#you may need to center and scale first and then apply PCA
#or just use method = c("pca")
#create the preProc object, remember to exclude the response (medv)
preProc <- preProcess(train[,-14],
method = c("center", "scale", "pca"),
pcaComp = 7) # or thresh = 0.8
#Apply the processing to the train and test data, and add the response
#to the dataframes
train_pca <- predict(preProc, train[,-14])
train_pca$medv <- train$medv
test_pca <- predict(preProc, test[,-14])
test_pca$medv <- test$medv
#you can verify the 7 comp
> head(train_pca)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 medv
1 -2.063576 0.784975586 0.42188132 -0.4674029 -0.9208095 -0.1561148 0.2940533 24.0
2 -1.411319 0.605782852 -0.62260611 0.2258748 -0.4840448 0.3235172 0.5061220 21.6
3 -2.052144 0.514495591 0.18221545 0.9539644 -0.8148428 0.4832016 0.3699110 34.7
4 -2.596799 -0.068710981 -0.10115928 1.1308079 -0.4056899 0.6759937 0.4954385 33.4
5 -2.435048 0.032030728 -0.06201039 1.1046487 -0.5043492 0.6176695 0.5808873 36.2
6 -2.187428 -0.007289459 -0.63593163 0.6597568 -0.1828520 0.6043359 0.5659098 28.7
#Now fit your lm model, something like
fit <- lm(medv~., data = train_pca)
> fit$coefficients
(Intercept) PC1 PC2 PC3 PC4 PC5 PC6 PC7
22.3524934 -2.2357451 1.5531484 3.2346456 2.3612132 -1.7321590 -0.4438279 -0.2850688 顺便说一句,下次当你问问题时,试着发布一个可重现的例子(代码+数据),这样人们就可以理解问题并帮助你。
https://stackoverflow.com/questions/32714063
复制相似问题