我有时需要根据其中一个变量的值从data.frame中提取特定的行。R内置了maximum (which.max())和minimum (which.min())函数,使我可以轻松地提取这些行。
median是否有等效项?或者我最好的选择就是编写我自己的函数?
下面是一个data.frame示例,以及我如何使用which.max()和which.min()
set.seed(1) # so you can reproduce this example
dat = data.frame(V1 = 1:10, V2 = rnorm(10), V3 = rnorm(10),
V4 = sample(1:20, 10, replace=T))
# To return the first row, which contains the max value in V4
dat[which.max(dat$V4), ]
# To return the seventh row, which contains the min value in V4
dat[which.min(dat$V4), ]对于这个特定的例子,因为有偶数个观察值,所以我需要返回两行,在本例中是第2行和第10行。
更新
似乎没有内置的函数来实现这一点。因此,使用reply from Sacha作为起点,我编写了以下函数:
which.median = function(x) {
if (length(x) %% 2 != 0) {
which(x == median(x))
} else if (length(x) %% 2 == 0) {
a = sort(x)[c(length(x)/2, length(x)/2+1)]
c(which(x == a[1]), which(x == a[2]))
}
}我可以按如下方式使用它:
# make one data.frame with an odd number of rows
dat2 = dat[-10, ]
# Median rows from 'dat' (even number of rows) and 'dat2' (odd number of rows)
dat[which.median(dat$V4), ]
dat2[which.median(dat2$V4), ]有什么建议可以改进这一点吗?
发布于 2012-04-21 18:14:09
虽然Sacha的解决方案非常通用,但中位数(或其他分位数)是顺序统计量,因此您可以从order (x) (而不是分位数的sort (x) )计算相应的索引。
查看quantile,可以使用类型1或3,在某些情况下,所有其他类型都会导致两个值的(加权)平均值。
我选择了类型3,然后从quantile复制粘贴到:
which.quantile <- function (x, probs, na.rm = FALSE){
if (! na.rm & any (is.na (x)))
return (rep (NA_integer_, length (probs)))
o <- order (x)
n <- sum (! is.na (x))
o <- o [seq_len (n)]
nppm <- n * probs - 0.5
j <- floor(nppm)
h <- ifelse((nppm == j) & ((j%%2L) == 0L), 0, 1)
j <- j + h
j [j == 0] <- 1
o[j]
}一个小测试:
> x <-c (2.34, 5.83, NA, 9.34, 8.53, 6.42, NA, 8.07, NA, 0.77)
> probs <- c (0, .23, .5, .6, 1)
> which.quantile (x, probs, na.rm = TRUE)
[1] 10 1 6 6 4
> x [which.quantile (x, probs, na.rm = TRUE)] == quantile (x, probs, na.rm = TRUE, type = 3)
0% 23% 50% 60% 100%
TRUE TRUE TRUE TRUE TRUE 下面是你的例子:
> dat [which.quantile (dat$V4, c (0, .5, 1)),]
V1 V2 V3 V4
7 7 0.4874291 -0.01619026 1
2 2 0.1836433 0.38984324 13
1 1 -0.6264538 1.51178117 17发布于 2012-04-21 13:48:21
我认为只是:
which(dat$V4 == median(dat$V4))但要小心,因为如果没有一个中间数字,中位数取两个数字的平均值。例如,median(1:4)给出的2.5与任何元素都不匹配。
编辑
下面是一个函数,它将给出中位数的元素或与中位数均值的第一个匹配,类似于which.min()给出的第一个元素仅等于最小值:
whichmedian <- function(x) which.min(abs(x - median(x)))例如:
> whichmedian(1:4)
[1] 2发布于 2012-04-21 17:23:25
我写了一个更全面的函数来满足我的需求:
row.extractor = function(data, extract.by, what) {
# data = your data.frame
# extract.by = the variable that you are extracting by, either
# as its index number or by name
# what = either "min", "max", "median", or "all", with quotes
if (is.numeric(extract.by) == 1) {
extract.by = extract.by
} else if (is.numeric(extract.by) != 0) {
extract.by = which(colnames(dat) %in% "extract.by")
}
which.median = function(data, extract.by) {
a = data[, extract.by]
if (length(a) %% 2 != 0) {
which(a == median(a))
} else if (length(a) %% 2 == 0) {
b = sort(a)[c(length(a)/2, length(a)/2+1)]
c(max(which(a == b[1])), min(which(a == b[2])))
}
}
X1 = data[which(data[extract.by] == min(data[extract.by])), ]
X2 = data[which(data[extract.by] == max(data[extract.by])), ]
X3 = data[which.median(data, extract.by), ]
if (what == "min") {
X1
} else if (what == "max") {
X2
} else if (what == "median") {
X3
} else if (what == "all") {
rbind(X1, X3, X2)
}
}下面是一些用法示例:
> row.extractor(dat, "V4", "max")
V1 V2 V3 V4
1 1 -0.6264538 1.511781 17
> row.extractor(dat, 4, "min")
V1 V2 V3 V4
7 7 0.4874291 -0.01619026 1
> row.extractor(dat, "V4", "all")
V1 V2 V3 V4
7 7 0.4874291 -0.01619026 1
2 2 0.1836433 0.38984324 13
10 10 -0.3053884 0.59390132 14
4 1 -0.6264538 1.51178117 17https://stackoverflow.com/questions/10256503
复制相似问题