文章/答案/技术大牛

发布

社区首页 >问答首页 >Mclust:影响聚类结果的输入参数顺序

问Mclust:影响聚类结果的输入参数顺序
EN

Stack Overflow用户

提问于 2013-12-05 13:50:48

回答 4查看 1.2K关注 0票数 5

我正在使用mclust来查看我的数据集中的各种集群，使用不同数量的输入(下面脚本中的X、Y、Z、R和S)：

例如：

elements<-cbind(X,Y,Z,R,S)
dataclust<-Mclust(elements)

我刚刚发现输入参数的顺序很重要，并影响结果；换句话说，与elements-<cbind(Y,Z,X,R,S)相比，elements <- cbind(X,Y,Z,R,S)给出了不同的聚类。我的理解是，在聚类分析中，所有输入参数都具有相同的权重和重要性。是我错了还是这是个bug？

我已经在R 2.15.3和其他两个R版本中看到了这一点。

如对以上内容有任何意见或解释，欢迎光临。

cluster-analysis

回答 4

Stack Overflow用户

发布于 2017-06-25 20:50:59

不幸的是，我无法评论或编辑我之前的评论，所以我发布了一个答案。@m-dz让我走上了一条我认为已经揭示了可能答案的道路。具体地说：

> library(mclust)
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /   
 / /  / / /___/ /___/ /_/ /___/ // /    
/_/  /_/\____/_____/\____//____//_/    version 5.2.2
Type 'citation("mclust")' for citing this R package in publications.

> testDataA <- read.table("http://fimi.ua.ac.be/data/chess.dat")

> summary(Mclust(subset(testDataA, select = c(V1, V3, V5, V7, V9, V11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> summary(Mclust(subset(testDataA, select = c(V11, V9, V1, V3, V5, V7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mclust_5.2.2

loaded via a namespace (and not attached):
[1] tools_3.3.2

正如你所看到的，这产生了两个匹配@m-dz的解决方案！但是，我之前所做的是加载psych包。我现在看到这是从mclust中屏蔽sim。我猜这会导致不正确的解决方案：

> library(psych)

Attaching package: ‘psych’

The following object is masked from ‘package:mclust’:

    sim

> testDataB <- read.file(f = "http://fimi.ua.ac.be/data/chess.dat")
Data from the .data file http://fimi.ua.ac.be/data/chess.dat has been loaded.

> summary(Mclust(subset(testDataB, select = c(X1, X3, X5, X7, X9, X11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 2 components:

 log.likelihood    n df      BIC      ICL
       3547.068 3195 49 6698.738 6692.126

Clustering table:
   1    2 
2759  436 

> summary(Mclust(subset(testDataB, select = c(X11, X9, X1, X3, X5, X7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 6 components:

 log.likelihood    n  df      BIC      ICL
       18473.94 3195 137 35842.37 35834.51

Clustering table:
  1   2   3   4   5   6 
431 932 210 881 524 217 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] psych_1.6.9  mclust_5.2.2

loaded via a namespace (and not attached):
[1] parallel_3.3.2 tools_3.3.2    foreign_0.8-67 mnormt_1.5-5

票数 2

Stack Overflow用户

发布于 2013-12-06 17:26:51

通常，高斯混合模型聚类是随机初始化的，因为它只会找到局部最大值。

不要期望它总是返回相同的结果。

票数 1

Stack Overflow用户

发布于 2017-06-24 20:20:28

编辑：

我之前的编辑结果。read.file将第一行作为标题处理是正确的，但事实并非如此。显然，无论是调用V1, V2, V3, V4, V5, V6还是X1, X3, X5, X7, X9, X11，第1到6列都会给出不同的结果。稍后我将进一步调查。

library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_United Kingdom.1252 
# [2] LC_CTYPE=English_United Kingdom.1252   
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods  
# [7] base     
# 
# other attached packages:
#   [1] magrittr_1.5 psych_1.7.5  mclust_5.3  
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.4.0    parallel_3.4.0    tools_3.4.0      
# [4] foreign_0.8-68    rstudioapi_0.6    mdaddins_0.0.0001
# [7] nlme_3.1-131      mnormt_1.5-5      grid_3.4.0       
# [10] lattice_0.20-35  

testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE)  # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")

testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()

# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))

# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))

# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))

老生常谈：

我已经尽了最大的努力来调查这个问题：

Current R (3.4.0)和mclust (5.3)测试:顺序和种子没有影响；
MCLUST4.2(问题出现在12月5‘13日)，相同，没有影响；@user3068797提到
R 2.25.3 :无法编译MCLUST4.2，放弃了，因为调试时间太长；
@Cody没有提供sessionInfo()，所以不知道从哪里挖掘更多。

F213

到代码中：

library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# other attached packages:
# [1] mclust_5.3

testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")

## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 



## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)

## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
  if(is.null(upper)) {
    upper <- any(x[row(x) < col(x)])
    lower <- any(x[row(x) > col(x)])
    if(upper && lower)
      stop("not a triangular matrix")
    if(!(upper || lower)) {
      x <- diag(x)
      return(diag(x * x))
    }
  }
  dimx <- dim(x)
  storage.mode(x) <- "double"
  .Fortran("uncholf",
           as.logical(upper),
           x,
           as.integer(nrow(x)),
           as.integer(ncol(x)),
           integer(1),
           PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol

## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
#   optimal number of clusters occurs at max choice

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
# log.likelihood    n df      BIC       ICL
#      -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
#   optimal number of clusters occurs at max choice



## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)

编辑：

Fortran函数unchol (mclus4.2)和uncholf (mclus5.3)没有区别：uncholf 5.3、unchol 4.3

Mclust 4.3确实有所不同，但提供了相同结果，所以我猜更改只是修复错误等：Mclust 5.3，Mclust

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/20392452

复制

相似问题

问Mclust:影响聚类结果的输入参数顺序
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Mclust:影响聚类结果的输入参数顺序EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Mclust:影响聚类结果的输入参数顺序
EN