我正在使用mclust来查看我的数据集中的各种集群,使用不同数量的输入(下面脚本中的X、Y、Z、R和S):
例如:
elements<-cbind(X,Y,Z,R,S)
dataclust<-Mclust(elements)我刚刚发现输入参数的顺序很重要,并影响结果;换句话说,与elements-<cbind(Y,Z,X,R,S)相比,elements <- cbind(X,Y,Z,R,S)给出了不同的聚类。我的理解是,在聚类分析中,所有输入参数都具有相同的权重和重要性。是我错了还是这是个bug?
我已经在R 2.15.3和其他两个R版本中看到了这一点。
如对以上内容有任何意见或解释,欢迎光临。
发布于 2017-06-25 20:50:59
不幸的是,我无法评论或编辑我之前的评论,所以我发布了一个答案。@m-dz让我走上了一条我认为已经揭示了可能答案的道路。具体地说:
> library(mclust)
__ ___________ __ _____________
/ |/ / ____/ / / / / / ___/_ __/
/ /|_/ / / / / / / / /\__ \ / /
/ / / / /___/ /___/ /_/ /___/ // /
/_/ /_/\____/_____/\____//____//_/ version 5.2.2
Type 'citation("mclust")' for citing this R package in publications.
> testDataA <- read.table("http://fimi.ua.ac.be/data/chess.dat")
> summary(Mclust(subset(testDataA, select = c(V1, V3, V5, V7, V9, V11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EII (spherical, equal volume) model with 9 components:
log.likelihood n df BIC ICL
-3597.466 3196 63 -7703.32 -7735.137
Clustering table:
1 2 3 4 5 6 7 8 9
774 150 752 486 227 224 238 178 167
> summary(Mclust(subset(testDataA, select = c(V11, V9, V1, V3, V5, V7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EII (spherical, equal volume) model with 9 components:
log.likelihood n df BIC ICL
-3597.466 3196 63 -7703.32 -7735.137
Clustering table:
1 2 3 4 5 6 7 8 9
774 150 752 486 227 224 238 178 167
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mclust_5.2.2
loaded via a namespace (and not attached):
[1] tools_3.3.2正如你所看到的,这产生了两个匹配@m-dz的解决方案!但是,我之前所做的是加载psych包。我现在看到这是从mclust中屏蔽sim。我猜这会导致不正确的解决方案:
> library(psych)
Attaching package: ‘psych’
The following object is masked from ‘package:mclust’:
sim
> testDataB <- read.file(f = "http://fimi.ua.ac.be/data/chess.dat")
Data from the .data file http://fimi.ua.ac.be/data/chess.dat has been loaded.
> summary(Mclust(subset(testDataB, select = c(X1, X3, X5, X7, X9, X11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EEV (ellipsoidal, equal volume and shape) model with 2 components:
log.likelihood n df BIC ICL
3547.068 3195 49 6698.738 6692.126
Clustering table:
1 2
2759 436
> summary(Mclust(subset(testDataB, select = c(X11, X9, X1, X3, X5, X7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EEV (ellipsoidal, equal volume and shape) model with 6 components:
log.likelihood n df BIC ICL
18473.94 3195 137 35842.37 35834.51
Clustering table:
1 2 3 4 5 6
431 932 210 881 524 217
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] psych_1.6.9 mclust_5.2.2
loaded via a namespace (and not attached):
[1] parallel_3.3.2 tools_3.3.2 foreign_0.8-67 mnormt_1.5-5 发布于 2013-12-06 17:26:51
通常,高斯混合模型聚类是随机初始化的,因为它只会找到局部最大值。
不要期望它总是返回相同的结果。
发布于 2017-06-24 20:20:28
编辑:
我之前的编辑结果。read.file将第一行作为标题处理是正确的,但事实并非如此。显然,无论是调用V1, V2, V3, V4, V5, V6还是X1, X3, X5, X7, X9, X11,第1到6列都会给出不同的结果。稍后我将进一步调查。
library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252
# [2] LC_CTYPE=English_United Kingdom.1252
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United Kingdom.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods
# [7] base
#
# other attached packages:
# [1] magrittr_1.5 psych_1.7.5 mclust_5.3
#
# loaded via a namespace (and not attached):
# [1] compiler_3.4.0 parallel_3.4.0 tools_3.4.0
# [4] foreign_0.8-68 rstudioapi_0.6 mdaddins_0.0.0001
# [7] nlme_3.1-131 mnormt_1.5-5 grid_3.4.0
# [10] lattice_0.20-35
testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE) # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()
# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))
# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))
# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))
# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))
# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))老生常谈:
我已经尽了最大的努力来调查这个问题:
F213
到代码中:
library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# other attached packages:
# [1] mclust_5.3
testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")
## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)
## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
if(is.null(upper)) {
upper <- any(x[row(x) < col(x)])
lower <- any(x[row(x) > col(x)])
if(upper && lower)
stop("not a triangular matrix")
if(!(upper || lower)) {
x <- diag(x)
return(diag(x * x))
}
}
dimx <- dim(x)
storage.mode(x) <- "double"
.Fortran("uncholf",
as.logical(upper),
x,
as.integer(nrow(x)),
as.integer(ncol(x)),
integer(1),
PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol
## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
#
# Warning messages:
# 1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
# best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
# optimal number of clusters occurs at max choice
set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
#
# Warning messages:
# 1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
# best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
# optimal number of clusters occurs at max choice
## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)编辑:
Fortran函数unchol (mclus4.2)和uncholf (mclus5.3)没有区别:uncholf 5.3、unchol 4.3
Mclust 4.3确实有所不同,但提供了相同结果,所以我猜更改只是修复错误等:Mclust 5.3,Mclust
https://stackoverflow.com/questions/20392452
复制相似问题