文章/答案/技术大牛

发布

社区首页 >问答首页 >在R，有客场和主场队的体育数据组--一种常见的挫折感

问在R，有客场和主场队的体育数据组--一种常见的挫折感
EN

Stack Overflow用户

提问于 2018-06-05 19:04:10

回答 4查看 374关注 0票数 5

我经常使用R中的体育数据，在试图计算汇总统计数据时，dplyr::group_by()也遇到了同样的问题。在世界杯小组赛的每一场比赛中，我都有以下预测分数的数据：

dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal", 
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany", 
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland", 
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France", 
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic", 
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia", 
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland", 
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan", 
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia", 
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark", 
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic", 
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco", 
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica", 
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama", 
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco", 
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany", 
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765, 
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741, 
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816, 
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264, 
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665, 
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932, 
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787, 
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251, 
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294, 
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608, 
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778, 
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503, 
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA", 
"epB"), class = "data.frame", row.names = c(NA, -48L))

head(worldcup.df)
  teamA_name   teamB_name    epA    epB
1     Russia Saudi Arabia 1.6400 1.0940
2      Egypt      Uruguay 0.7051 2.0809
3    Morocco         Iran 1.1294 1.6016
4   Portugal        Spain 1.1116 1.6204
5     France    Australia 2.1962 0.6098
6  Argentina      Iceland 1.9840 0.7870

我计算了epA和epB作为A队和B队在每场比赛中的预期点数，现在我想做一个group_by()来计算32支球队的预期积分。我在历史上所做的就是这样的：

asAgroupby = worldcup.df %>% 
  dplyr::group_by(teamA_name) %>%
  dplyr::summarise(totPts = sum(epA))

asBgroupby = worldcup.df %>% 
  dplyr::group_by(teamB_name) %>%
  dplyr::summarise(totPts = sum(epB))

outputdf = asAgroupby %>%
  dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
  dplyr::mutate(totPts = totPts.x + totPts.y) %>%
  dplyr::select(-one_of(c('totPts.x', 'totPts.y')))

两个单独的group_by()调用，对于每个teamA和teamB列，后面跟着一个left_join，然后对这些列进行求和并删除多余的列.真恶心。这是一个简单的情况，因为这个问题也出现了:准确的4列(2个标识列，2个stat列)。由于大量的体育数据包含了主队/客场队的专栏，这是一个常见的问题。

我觉得我需要一个数据，有2x的行数和1/2的列数，这样我就可以做一个组了。任何帮助都是感谢的，谢谢！

编辑: worldcup.df是从dplyr函数的长%>%构建的--如果这可以在不创建新变量的情况下完成，那么可以加分，但是只是：

worldcup.df %>%
  ...

dplyr

回答 4

Stack Overflow用户

回答已采纳

发布于 2018-06-05 19:24:29

下面是一个tidyverse工作流，它通过将数据重新格式化为长格式来工作。它确实跟踪谁是在同一场比赛(game_id)，以及他们是A或B队-如果这是有用的。(平心而论，这是@Emil的基本理念，只是实现它的一个不同的工作流。)

worldcup.long <- worldcup.df %>% 
  as_data_frame() %>%
  mutate(game_id = 1:n()) %>%
  gather(key, value, - game_id) %>%
  mutate(
    AB = str_extract(key, "A|B"),
    key = str_extract(key, "team|ep")
  ) %>%
  spread(key, value,convert = TRUE) 

outputdf <- worldcup.long %>%
  group_by(team) %>%
  summarize(totPts = sum(ep))

票数 3

Stack Overflow用户

发布于 2018-06-05 19:18:27

下面是一个较少行且不需要联接的解决方案：

df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))

# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

与“任择议定书”相同：

outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

票数 2

Stack Overflow用户

发布于 2018-06-05 19:34:30

我也遇到过这个问题，还有一些关于足球的幻想。我通常是这样处理的：

df %>% select(team = teamA_name, ep = epA) %>% 
     bind_rows(df %>% select(team = teamB_name, ep = epB)) %>% 
     group_by(team) %>% 
     summarize(ep = sum(ep))

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/50707247

复制

相似问题

问在R，有客场和主场队的体育数据组--一种常见的挫折感
EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R，有客场和主场队的体育数据组--一种常见的挫折感EN

回答 4

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在R，有客场和主场队的体育数据组--一种常见的挫折感
EN