首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Mclust:影响聚类结果的输入参数顺序

Mclust:影响聚类结果的输入参数顺序
EN

Stack Overflow用户
提问于 2013-12-05 13:50:48
回答 4查看 1.2K关注 0票数 5

我正在使用mclust来查看我的数据集中的各种集群,使用不同数量的输入(下面脚本中的X、Y、Z、R和S):

例如:

代码语言:javascript
复制
elements<-cbind(X,Y,Z,R,S)
dataclust<-Mclust(elements)

我刚刚发现输入参数的顺序很重要,并影响结果;换句话说,与elements-<cbind(Y,Z,X,R,S)相比,elements <- cbind(X,Y,Z,R,S)给出了不同的聚类。我的理解是,在聚类分析中,所有输入参数都具有相同的权重和重要性。是我错了还是这是个bug?

我已经在R 2.15.3和其他两个R版本中看到了这一点。

如对以上内容有任何意见或解释,欢迎光临。

EN

回答 4

Stack Overflow用户

发布于 2017-06-25 20:50:59

不幸的是,我无法评论或编辑我之前的评论,所以我发布了一个答案。@m-dz让我走上了一条我认为已经揭示了可能答案的道路。具体地说:

代码语言:javascript
复制
> library(mclust)
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /   
 / /  / / /___/ /___/ /_/ /___/ // /    
/_/  /_/\____/_____/\____//____//_/    version 5.2.2
Type 'citation("mclust")' for citing this R package in publications.

> testDataA <- read.table("http://fimi.ua.ac.be/data/chess.dat")

> summary(Mclust(subset(testDataA, select = c(V1, V3, V5, V7, V9, V11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> summary(Mclust(subset(testDataA, select = c(V11, V9, V1, V3, V5, V7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mclust_5.2.2

loaded via a namespace (and not attached):
[1] tools_3.3.2

正如你所看到的,这产生了两个匹配@m-dz的解决方案!但是,我之前所做的是加载psych包。我现在看到这是从mclust中屏蔽sim。我猜这会导致不正确的解决方案:

代码语言:javascript
复制
> library(psych)

Attaching package: ‘psych’

The following object is masked from ‘package:mclust’:

    sim

> testDataB <- read.file(f = "http://fimi.ua.ac.be/data/chess.dat")
Data from the .data file http://fimi.ua.ac.be/data/chess.dat has been loaded.

> summary(Mclust(subset(testDataB, select = c(X1, X3, X5, X7, X9, X11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 2 components:

 log.likelihood    n df      BIC      ICL
       3547.068 3195 49 6698.738 6692.126

Clustering table:
   1    2 
2759  436 

> summary(Mclust(subset(testDataB, select = c(X11, X9, X1, X3, X5, X7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 6 components:

 log.likelihood    n  df      BIC      ICL
       18473.94 3195 137 35842.37 35834.51

Clustering table:
  1   2   3   4   5   6 
431 932 210 881 524 217 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] psych_1.6.9  mclust_5.2.2

loaded via a namespace (and not attached):
[1] parallel_3.3.2 tools_3.3.2    foreign_0.8-67 mnormt_1.5-5  
票数 2
EN

Stack Overflow用户

发布于 2013-12-06 17:26:51

通常,高斯混合模型聚类是随机初始化的,因为它只会找到局部最大值。

不要期望它总是返回相同的结果。

票数 1
EN

Stack Overflow用户

发布于 2017-06-24 20:20:28

编辑:

我之前的编辑结果。read.file将第一行作为标题处理是正确的,但事实并非如此。显然,无论是调用V1, V2, V3, V4, V5, V6还是X1, X3, X5, X7, X9, X11,第1到6列都会给出不同的结果。稍后我将进一步调查。

代码语言:javascript
复制
library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_United Kingdom.1252 
# [2] LC_CTYPE=English_United Kingdom.1252   
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods  
# [7] base     
# 
# other attached packages:
#   [1] magrittr_1.5 psych_1.7.5  mclust_5.3  
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.4.0    parallel_3.4.0    tools_3.4.0      
# [4] foreign_0.8-68    rstudioapi_0.6    mdaddins_0.0.0001
# [7] nlme_3.1-131      mnormt_1.5-5      grid_3.4.0       
# [10] lattice_0.20-35  

testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE)  # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")

testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()

# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))

# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))

# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))

老生常谈:

我已经尽了最大的努力来调查这个问题:

  • Current R (3.4.0)和mclust (5.3)测试:顺序和种子没有影响;
  • MCLUST4.2(问题出现在12月5‘13日),相同,没有影响;@user3068797提到
  • R 2.25.3 :无法编译MCLUST4.2,放弃了,因为调试时间太长;
  • @Cody没有提供sessionInfo(),所以不知道从哪里挖掘更多。

F213

到代码中:

代码语言:javascript
复制
library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# other attached packages:
# [1] mclust_5.3

testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")

## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 



## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)

## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
  if(is.null(upper)) {
    upper <- any(x[row(x) < col(x)])
    lower <- any(x[row(x) > col(x)])
    if(upper && lower)
      stop("not a triangular matrix")
    if(!(upper || lower)) {
      x <- diag(x)
      return(diag(x * x))
    }
  }
  dimx <- dim(x)
  storage.mode(x) <- "double"
  .Fortran("uncholf",
           as.logical(upper),
           x,
           as.integer(nrow(x)),
           as.integer(ncol(x)),
           integer(1),
           PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol

## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
#   optimal number of clusters occurs at max choice

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
# log.likelihood    n df      BIC       ICL
#      -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
#   optimal number of clusters occurs at max choice



## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)

编辑:

Fortran函数unchol (mclus4.2)和uncholf (mclus5.3)没有区别:uncholf 5.3unchol 4.3

Mclust 4.3确实有所不同,但提供了相同结果,所以我猜更改只是修复错误等:Mclust 5.3,Mclust

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/20392452

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档