请帮我解决这个问题。我检查过其他各种帖子,但我不能把它们拼凑在一起。我有数据,大约有10万名运动员和他们参加的训练项目的记录。我简化了数据,但是这种方法适合整个数据集。
Data.frame的代码:
# Fictitious data
days <- seq(as.Date("2016/01/01"), as.Date("2016/01/28"), "days")
events <- c("Run","Swim","Swim","Cycle","Rest","Gym","Swim","Run",
"Cycle","Run","Swim","Swim","Run","Swim","Cycle","Rest","Gym",
"Swim","Swim","Swim","Run","Swim","Run","Gym","Rest","Gym",
"Cycle","Swim")
my.data <- data.frame(athlete = 1, days,events)
# Note - This data repeats for many participants, but I did not include more than 1我需要标志运动员谁完成了至少3项游泳每周至少连续两周。
编辑:我没有正确地考虑过这一点。让我们把事情搞得更复杂一点。假设我们使用跑步周,即一组7天,而不是日历周,从每个运动员第一次游泳项目开始。
Update:我还有另一个挑战,假设我只想寻找一个模式,即每5天至少连续10天进行3次游泳活动,数据中的anywhere。
谢谢
发布于 2017-01-19 23:51:42
您可以进行两步总结,首先计算每个运动员每周的游泳次数,然后检查是否有连续几周运动员游泳次数超过三次:
library(dplyr)
library(lubridate)
my.data %>%
arrange(days) %>%
group_by(athlete, w = week(days)) %>%
summarise(n_swim = sum(events == "Swim")) %>%
group_by(athlete) %>%
summarise(flag = any(diff(w[n_swim >= 3]) == 1))
# A tibble: 1 x 2
# athlete flag
# <dbl> <lgl>
#1 1 TRUE更新:若要设置从第一次游泳开始的一周,请使用which.max()查找第一个Swim出现的索引,然后在这一天减去所有的天数以得到日差,然后如果进行模(7)计算,周数将从这一天开始:
my.data %>%
arrange(days) %>% group_by(athlete) %>%
mutate(Swim = events == "Swim",
w = as.integer(days - days[which.max(Swim)]) %/% 7) %>%
# the first swim day is set as zero, a modulo of 7 will give week number
# starting from this day
group_by(w, add = TRUE) %>%
summarise(n_swim = sum(Swim)) %>%
group_by(athlete) %>%
summarise(flag = any(diff(w[n_swim >= 3]) == 1))
# A tibble: 1 x 2
# athlete flag
# <dbl> <lgl>
#1 1 TRUE发布于 2017-01-20 00:34:07
快速而肮脏的代码,但请检查它是否适用于您的数据集:
library(tidyverse)
library(lubridate)
df %>%
mutate(weeknum=week(days)) %>%
group_by(athlete,weeknum) %>%
filter(events=='Swim') %>%
summarise(n=n()) %>%
mutate(gt_3=as.numeric(n>=3),
x=gt_3-lag(gt_3,1),
flag=x==0) %>%
filter(flag==T) %>%
select(athlete) %>%
distinct()https://stackoverflow.com/questions/41753710
复制相似问题