我经常使用R中的体育数据,在试图计算汇总统计数据时,dplyr::group_by()也遇到了同样的问题。在世界杯小组赛的每一场比赛中,我都有以下预测分数的数据:
dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal",
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany",
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland",
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France",
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic",
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia",
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland",
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan",
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia",
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark",
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic",
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco",
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica",
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama",
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco",
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany",
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765,
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741,
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816,
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264,
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665,
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932,
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787,
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251,
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294,
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608,
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778,
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503,
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA",
"epB"), class = "data.frame", row.names = c(NA, -48L))
head(worldcup.df)
teamA_name teamB_name epA epB
1 Russia Saudi Arabia 1.6400 1.0940
2 Egypt Uruguay 0.7051 2.0809
3 Morocco Iran 1.1294 1.6016
4 Portugal Spain 1.1116 1.6204
5 France Australia 2.1962 0.6098
6 Argentina Iceland 1.9840 0.7870我计算了epA和epB作为A队和B队在每场比赛中的预期点数,现在我想做一个group_by()来计算32支球队的预期积分。我在历史上所做的就是这样的:
asAgroupby = worldcup.df %>%
dplyr::group_by(teamA_name) %>%
dplyr::summarise(totPts = sum(epA))
asBgroupby = worldcup.df %>%
dplyr::group_by(teamB_name) %>%
dplyr::summarise(totPts = sum(epB))
outputdf = asAgroupby %>%
dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
dplyr::mutate(totPts = totPts.x + totPts.y) %>%
dplyr::select(-one_of(c('totPts.x', 'totPts.y')))两个单独的group_by()调用,对于每个teamA和teamB列,后面跟着一个left_join,然后对这些列进行求和并删除多余的列.真恶心。这是一个简单的情况,因为这个问题也出现了:准确的4列(2个标识列,2个stat列)。由于大量的体育数据包含了主队/客场队的专栏,这是一个常见的问题。
我觉得我需要一个数据,有2x的行数和1/2的列数,这样我就可以做一个组了。任何帮助都是感谢的,谢谢!
编辑: worldcup.df是从dplyr函数的长%>%构建的--如果这可以在不创建新变量的情况下完成,那么可以加分,但是只是:
worldcup.df %>%
... 发布于 2018-06-05 19:24:29
下面是一个tidyverse工作流,它通过将数据重新格式化为长格式来工作。它确实跟踪谁是在同一场比赛(game_id),以及他们是A或B队-如果这是有用的。(平心而论,这是@Emil的基本理念,只是实现它的一个不同的工作流。)
worldcup.long <- worldcup.df %>%
as_data_frame() %>%
mutate(game_id = 1:n()) %>%
gather(key, value, - game_id) %>%
mutate(
AB = str_extract(key, "A|B"),
key = str_extract(key, "team|ep")
) %>%
spread(key, value,convert = TRUE)
outputdf <- worldcup.long %>%
group_by(team) %>%
summarize(totPts = sum(ep))发布于 2018-06-05 19:18:27
下面是一个较少行且不需要联接的解决方案:
df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr> <dbl>
1 Argentina 6.02
2 Australia 2.38
3 Belgium 5.70
4 Brazil 7.03
5 Colombia 5.82
6 Costa Rica 2.64
7 Croatia 4.40
8 Denmark 3.86
9 Egypt 3.44
10 England 5.82与“任择议定书”相同:
outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr> <dbl>
1 Argentina 6.02
2 Australia 2.38
3 Belgium 5.70
4 Brazil 7.03
5 Colombia 5.82
6 Costa Rica 2.64
7 Croatia 4.40
8 Denmark 3.86
9 Egypt 3.44
10 England 5.82发布于 2018-06-05 19:34:30
我也遇到过这个问题,还有一些关于足球的幻想。我通常是这样处理的:
df %>% select(team = teamA_name, ep = epA) %>%
bind_rows(df %>% select(team = teamB_name, ep = epB)) %>%
group_by(team) %>%
summarize(ep = sum(ep))https://stackoverflow.com/questions/50707247
复制相似问题