下面是来自csv文件的数据片段。名单上有约翰去过的城市的名字和他在那里停留的时间。
sno City hours stayed
1 London 5
2 London 4
3 Dubai 2
4 Mumbai 8
5 Sydney 16
6 Sydney 16
7 Dubai 2
8 London 8
9 London 9
10 Paris 17 我需要在计算以下几个方面的帮助:
发布于 2016-07-20 10:22:10
library(dplyr)
df <- tbl_df(df)I.约翰访问最多的城市的名称(按参观次数分列)
df %>%
select(City) %>%
table() %>%
sort(decreasing=T)
# London Dubai Sydney Mumbai Paris
# 4 2 2 1 1
# 2nd alternative
df %>%
group_by(City) %>%
summarise(n=n()) %>%
arrange(desc(n))
# Source: local data frame [5 x 2]
# City n
# (fctr) (int)
# 1 London 4
# 2 Dubai 2
# 3 Sydney 2
# 4 Mumbai 1
# 5 Paris 1二.他逗留最长(累计逗留)小时的城市名称
df %>%
group_by(City) %>%
mutate(cumsum(hours_stayed)) %>%
arrange(City)
# Source: local data frame [10 x 4]
# Groups: City [5]
# sno City hours_stayed cumsum(hours_stayed)
# (int) (fctr) (int) (int)
# 1 3 Dubai 2 2
# 2 7 Dubai 2 4
# 3 1 London 5 5
# 4 2 London 4 9
# 5 8 London 8 17
# 6 9 London 9 26
# 7 4 Mumbai 8 8
# 8 10 Paris 17 17
# 9 5 Sydney 16 16
# 10 6 Sydney 16 32
df %>%
group_by(City) %>%
summarise(sum(cumsum(hours_stayed)))
# Source: local data frame [5 x 2]
# City sum(cumsum(hours_stayed))
# (fctr) (int)
# 1 Dubai 6
# 2 London 57
# 3 Mumbai 8
# 4 Paris 17
# 5 Sydney 48一次访问中停留时间最长的城市的名称,有多少小时,哪个城市
df %>%
group_by(City) %>%
summarise(max(hours_stayed))
# Source: local data frame [5 x 2]
# City max(hours_stayed)
# (fctr) (int)
# 1 Dubai 2
# 2 London 9
# 3 Mumbai 8
# 4 Paris 17
# 5 Sydney 16四.每个城市的平均时数(累计小时)
df %>%
group_by(City) %>%
summarise(sum(mean(hours_stayed)))
# Source: local data frame [5 x 2]
# City sum(mean(hours_stayed))
# (fctr) (dbl)
# 1 Dubai 2.0
# 2 London 6.5
# 3 Mumbai 8.0
# 4 Paris 17.0
# 5 Sydney 16.0发布于 2016-07-20 09:47:50
我们可以使用dplyr通过“City”获得summarize输出,然后从输出中得到具有最大值的“City”。可能有很多方法可走,但dplyr是最容易理解和简单的。此外,对于大型数据集,使用dplyr/data.table也是非常有效的。
library(dplyr)
res <- df1 %>%
group_by(City) %>%
summarise(n = n(),
totalHours = sum(hours_stayed),
maxHours = max(hours_stayed),
meanHours = mean(hours_stayed))
res %>%
summarise_each(funs(City[which.max(.)]), -City)
# n totalHours maxHours meanHours
# <chr> <chr> <chr> <chr>
#1 London Sydney Paris Paris每个城市的平均活动数量可以从“区域”本身得到。
res %>%
select(City, meanHours)
# City meanHours
# <chr> <dbl>
#1 Dubai 2.0
#2 London 6.5
#3 Mumbai 8.0
#4 Paris 17.0
#5 Sydney 16.0注意:
( 1)在以最大访问次数或其他情况为纽带的情况下,选择第一个最大值。
( 2)我们可以在单个管道流中完成所有操作,而不是一次又一次调用函数。
另一个有效的选择是data.table
library(data.table)
res2 <- setDT(df1)[, .(n = .N, totalHours = sum(hours_stayed,
maxHours = max(hours_stayed),
meanHours = mean(hours_stayed))
, by = City]发布于 2016-07-20 09:51:17
这是太多的问题,这是一个很容易的解决办法,可以在基础R。
#Name of most visited city by john (by number of visits)
which.max(table(df$City))
#London
# 2
#Name of City where he stayed for longest (cumulative stay) hour
aggdata = aggregate(hoursstayed ~ City, df, sum)
aggdata[which.max(aggdata$hoursstayed), ]
# City hoursstayed
#5 Sydney 32
#Name of city where he stayed for longest time in a single visit ,
# how many hours and which city
df[which.max(df$hoursstayed), ]
# sno City hoursstayed
#10 10 Paris 17
#average number of hours in each of the city (cumulative hours)
aggregate(hoursstayed ~ City, df, mean)
# City hoursstayed
#1 Dubai 2.0
#2 London 6.5
#3 Mumbai 8.0
#4 Paris 17.0
#5 Sydney 16.0https://stackoverflow.com/questions/38477730
复制相似问题