给定一个数据集,我想使用here中描述的等频和等宽两种方法将其划分为4个bin,但我希望使用R语言。
数据集:
0, 4, 12, 16, 16, 18, 24, 26, 28我试着写了一个等宽箱的小代码,但它只产生了一个直方图。
bins<-4;
minimumVal<-min(dataset)
maximumVal<-max(dataset)
width=(maximumVal-minimumVal)/bins;
edges = minimumVal:width:maximumVal;
hist(dataset, breaks = "Sturges", freq = TRUE, xlim = range(edges))我是R的新手。
发布于 2017-02-04 21:01:39
对于等宽的装箱,我建议使用classInt包:
dataset <- c(0, 4, 12, 16, 16, 18, 24, 26, 28)
library(classInt)
classIntervals(dataset, 4)
x <- classIntervals(dataset, 4, style = 'equal')要使用中断,您可以检查x$brks。
至于等频率绑定,您可以使用相同的包,并带有选项style = 'quantile'
classIntervals(dataset, 4, style = 'quantile')由于dataset (16)中的重复值,以及因为数据集有9个元素而不能被精确地拆分到4个具有严格相同数量的元素的存储箱中,它不能在完全相等大小的存储箱中分离。我不知道这是不是一个问题,因为在提供的参考中,它说
"...每个组包含大致相同数量的值。“
由于您没有明确表示要查找的确切方法,因此我建议使用this post作为另一种方法,在您的示例中应该是:
library(Hmisc)
table(cut2(dataset, m = length(dataset)/4))此外,上面建议的链接中的其他帖子提供了其他替代方法和一些关于这些方法的相关讨论。
发布于 2017-02-04 15:49:41
您可以尝试对equal-width-binning执行以下操作
set.seed(1)
dataset <- runif(100, 0, 10) # some random data
bins<-4
minimumVal<-min(dataset)
maximumVal<-max(dataset)
width=(maximumVal-minimumVal)/bins;
cut(dataset, breaks=seq(minimumVal, maximumVal, width))
#[1] (2.58,5.03] (2.58,5.03] (5.03,7.47] (7.47,9.92] (0.134,2.58] (7.47,9.92] (7.47,9.92] (5.03,7.47] (5.03,7.47] (0.134,2.58] (0.134,2.58] (0.134,2.58]
#[13] (5.03,7.47] (2.58,5.03] (7.47,9.92] (2.58,5.03] (5.03,7.47] (7.47,9.92] (2.58,5.03] (7.47,9.92] (7.47,9.92] (0.134,2.58] (5.03,7.47] (0.134,2.58]
#[25] (2.58,5.03] (2.58,5.03] <NA> (2.58,5.03] (7.47,9.92] (2.58,5.03] (2.58,5.03] (5.03,7.47] (2.58,5.03] (0.134,2.58] (7.47,9.92] (5.03,7.47]
#[37] (7.47,9.92] (0.134,2.58] (5.03,7.47] (2.58,5.03] (7.47,9.92] (5.03,7.47] (7.47,9.92] (5.03,7.47] (5.03,7.47] (7.47,9.92] (0.134,2.58] (2.58,5.03]
#[49] (5.03,7.47] (5.03,7.47] (2.58,5.03] (7.47,9.92] (2.58,5.03] (0.134,2.58] (0.134,2.58] (0.134,2.58] (2.58,5.03] (5.03,7.47] (5.03,7.47] (2.58,5.03]
#[61] (7.47,9.92] (2.58,5.03] (2.58,5.03] (2.58,5.03] (5.03,7.47] (0.134,2.58] (2.58,5.03] (7.47,9.92] (0.134,2.58] (7.47,9.92] (2.58,5.03] (7.47,9.92]
#[73] (2.58,5.03] (2.58,5.03] (2.58,5.03] (7.47,9.92] (7.47,9.92] (2.58,5.03] (7.47,9.92] (7.47,9.92] (2.58,5.03] (5.03,7.47] (2.58,5.03] (2.58,5.03]
#[85] (7.47,9.92] (0.134,2.58] (5.03,7.47] (0.134,2.58] (0.134,2.58] (0.134,2.58] (0.134,2.58] (0.134,2.58] (5.03,7.47] (7.47,9.92] (7.47,9.92] (7.47,9.92]
#[97] (2.58,5.03] (2.58,5.03] (7.47,9.92] (5.03,7.47]
#Levels: (0.134,2.58] (2.58,5.03] (5.03,7.47] (7.47,9.92]
#plot frequencies in the bins
barplot(table(cut(dataset, breaks=seq(minimumVal, maximumVal, width))))

https://stackoverflow.com/questions/42037740
复制相似问题