我正在尝试从Apache Mahout Cookbook中的一个合成控制数据示例中了解Canopy集群。但是,我没有得到6个聚类结果,而是得到了600个-集合中的每个样本一个。
C-0{n=1 c=0:28.781,1:34.463,2:31.338,3:31.283,4:28.921,5:33.760,6:25.397,7:27.785,8:35.248,9:27.116,10:32.872,11:29.217,12:36.025,13:32.337,15:34.525,16:32.872,17:34.117,18:26.524,19:27.662,20:26.369,21:25.774,22:29.270,25:30.733,26:29.505,27:33.029,28:25.040,31:28.917,32:24.344,33:26.120,34:34.942,35:25.029,36:26.631,37:35.654,38:28.435,39:29.150,40:28.158,41:26.193,42:33.318,43:30.977,44:27.044,45:35.534,46:26.235,47:28.996,48:32.004,49:31.056,50:34.255,51:28.072,52:28.940,53:35.497,54:29.747,56:31.433,57:24.556,58:33.743,59:25.047,60:34.932 r=[]}
C-1{n=1 c=0:24.892,1:25.741,3:27.553,4:32.822,5:27.879,6:31.593,7:31.486,8:35.547,9:27.952,10:31.660,11:27.542,12:31.189,13:27.487,14:31.391,16:27.811,18:24.488,20:27.592,21:35.627,22:35.410,23:31.417,24:30.745,25:24.131,26:35.142,27:30.472,28:31.987,29:33.662,30:25.551,31:30.469,32:33.647,33:25.070,34:34.077,35:32.598,36:28.304,37:26.147,38:26.941,39:31.520,40:33.109,41:24.149,42:28.516,43:25.791,44:35.952,45:26.530,46:24.858,47:25.956,48:32.836,49:28.532,50:26.346,51:30.621,52:28.986,53:29.405,54:32.558,55:31.021,56:26.642,57:28.433,58:33.656,59:26.424,60:28.466 r=[]}
C-2{n=1 c=0:31.399,1:30.632,2:26.398,3:24.291,4:27.861,5:28.549,6:24.972,7:32.436,8:25.224,9:27.307,10:31.839,11:27.259,12:28.257,13:26.582,14:24.046,15:35.063,16:31.572,17:32.561,18:31.031,19:34.120,20:26.934,21:31.478,22:35.017,23:32.385,24:24.332,25:30.200,26:31.245,27:26.681,28:31.514,29:28.878,30:27.309,31:24.246,33:26.963,34:25.292,35:31.611,36:24.713,37:27.481,38:24.208,39:26.806,40:35.125,41:32.629,42:31.056,43:26.358,44:28.086,45:31.439,46:27.306,47:29.608,48:35.973,49:34.144,50:27.172,51:33.632,52:26.597,53:25.539,54:32.543,55:25.577,56:29.990,57:31.351,59:33.900,60:29.545 r=[]}
C-3{n=1 c=0:25.774,2:30.526,3:35.421,4:25.603,5:27.970,8:25.270,9:28.132,11:29.427,12:31.455,13:27.320,16:28.956,17:28.992,18:29.958,19:30.277,20:30.445,21:24.304,22:24.314,24:35.097,25:25.368,26:32.097,27:33.330,28:25.010,29:35.316,30:31.626,31:29.281,32:34.202,33:26.508,34:32.228,35:25.527,36:24.824,38:27.559,39:28.371,40:32.367,41:26.975,42:35.935,43:35.115,44:24.375,45:27.608,46:27.843,47:29.856,48:32.419,49:26.891,50:31.321,51:29.385,52:34.334,53:24.738,54:35.769,56:31.873,57:34.205,58:31.156,60:34.629 r=[]}
以此类推,直到C-600。
有人能想出一个理由吗?
我正在使用
mahout canopy -i $WORK_DIR/sequencefile/synthetic_control.seq -o
$WORK_DIR/output/canopy.output -t1 80 -t2 55我使用的是在Hadooop 1.2.1上运行的Mahout 0.9。书中的例子是0.9版本的Mahout,调用函数的方式有变化吗?
我甚至尝试使用t1和t2的不同值,但结果是相同的。
谢谢
发布于 2014-08-22 21:59:41
Canopy被用来为Kmeans中的参数"K“创建一个猜测。它对t1和t2的选择是如此敏感,这是无用的国际海事组织。因此,它正在被弃用。
在Mahout中没有一个很好的替代方案,但你可以看看流式kmeans,或者尝试在kmeans的结果上使用clusterdump,并通过这种方式找到最适合你实际数据的k,以寻求最高的内聚力和最大的分离度。
https://stackoverflow.com/questions/25428894
复制相似问题