我不明白随机森林模型的重要性函数(randomForest包)和重要性值之间的区别是什么:
我计算了一个简单的RF分类模型,并尝试通过以下代码找到变量的重要性:
rf_model$importance
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 0.096886458 0.032546101 0.055488009 2472.172207
X2 0.030985037 0.025615202 0.027530078 1338.378297
X3 0.124302743 0.012551971 0.052402188 3091.891586
importance(rf_model)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 159.9149603 175.6265625 242.424683 2472.172207
X2 104.8273654 97.09338154 129.5084398 1338.378297
X3 157.0207876 86.93847182 216.6374153 3091.891586为什么在输出的前三列之间存在差异,而MeanDecreaseGini是相同的呢?
发布于 2018-03-12 12:51:55
默认情况下,当调用importance(rf_model)时,度量被划分为它们的“标准错误”。考虑一下这个例子:
library(randomForest)
set.seed(4543)
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000,
keep.forest=FALSE, importance=TRUE)
mtcars.rf$importance
#output
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096与上述相同
importance(mtcars.rf, scale = FALSE)
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096
default:
importance(mtcars.rf)
%IncMSE IncNodePurity
cyl 15.767986 162.38777
disp 19.885128 257.46627
hp 18.177916 200.22729
drat 7.002942 65.96165
wt 18.479239 250.94940
qsec 5.022593 38.52055
vs 4.427525 24.75183
am 6.435329 17.27180
gear 3.968845 16.33811
carb 8.207903 27.47096最后:
importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD
cyl disp hp drat wt qsec vs am gear carb
15.767986 19.885128 18.177916 7.002942 18.479239 5.022593 4.427525 6.435329 3.968845 8.207903与importance(mtcars.rf)[,1]相同
all.equal(importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD,
importance(mtcars.rf)[,1])
#output
TRUEhttps://stackoverflow.com/questions/49235585
复制相似问题