我试着在R中按年得到每个人的增减(示例如下)。我试图编写一个函数,它返回每个人添加和删除的内容,以及按年增加和下降的人数。在这个示例中,马克添加= 0,Add_act =N/A,Drop = 2,Drop_act =c(“篮球”,“排球”)。使用"for循环“是我的本能反应,对于如何设计该算法有任何建议吗?
谢谢,安妮
Year Name Activity
2010 Mark Tennis
2010 Mark Swim
2010 Mark Basketball
2010 Mark Volleyball
2010 Tom Swim
2010 Rachale Tennis
2010 Rachale Waterball
2010 Rachale Yoga
2010 Mary Volleyball
2010 Mary Yoga
2010 Kim Waterball
2011 Mark Tennis
2011 Mark Swim
2011 Tom Volleyball
2011 Tom Waterball
2011 Tom Swim
2011 Rachale Tennis
2011 Rachale Waterball
2011 Rachale Yoga
2011 Rachale Swim
2011 Mary Volleyball
2011 Jerry Basketball我期待的结果如下:
年份名称添加Add_act Drop Drop_act
2010年马克4“网球”、“游泳”、“篮球”、“排球”
2010年汤姆1“游泳”0纳
2010年Rachale 3“网球”、“水球”、“瑜伽”
2010年玛丽2“排球”、“瑜伽”0纳
2010年金1“水球”0纳
2011马克0NA 1“篮球”
2011年汤姆1“水球”0纳
2011年Rachale 1“游泳”0 NA
2011玛丽0NA 1“瑜伽”
2011曾傑瑞1“篮球”0 NA
2011年Kim 0 NA 1“水球”
发布于 2017-03-28 15:57:48
编辑:好的,既然我理解了您在整个数据集中进行聚合的愿望,那么您就需要使用循环。但是,您可以使用R中的*apply函数来实现这一点,这也会将您的输出放入一个很好的列表中。
我们可以使用我最初编写的简单函数,只需稍加修改就可以添加名称和年份(只是为了便于解释输出)。
该函数采用输入数据框架、您要检查的人以及您要评估的年份。然后,它形成两种媒介,一种是今年的活动,另一种是上一年的活动。然后,我们只需使用%in%运算符对每一个向量进行子集,得到加法和减法,然后用length求出总向量。
使用expand.grid,我们将在样本数据中得到所有可能的年份和个体组合。然后,使用mapply,我们可以创建这些组合的输出。结果是一个列表列表(我使用它是因为数据框架在这种情况下是不合理的,因为添加或删除的活动有不同的长度)。
我将您的数据放入使用read.csv读取的文本文件中。
options(stringsAsFactors = FALSE)
df_example <- read.csv(file = "C:/Users/trehman/Desktop/input.txt",header = F)
names(df_example) <- c("Year","Name","Activity")
func_find_changes <- function(data,person,year) {
curryr_acts <- data[data$Name == person
& data$Year == year,"Activity"]
prevyr_acts <- data[data$Name == person
& data$Year == year - 1,"Activity"]
added_acts <- curryr_acts[!(curryr_acts %in% prevyr_acts)]
dropped_acts <- prevyr_acts[!(prevyr_acts %in% curryr_acts)]
n_add <- length(added_acts)
n_drop <- length(dropped_acts)
return(list(Person = person,
Year = year,
Add = n_add,
Add_act = added_acts,
Drop = n_drop,
Drop_act = dropped_acts))
}
# Create all combinations to check
df_nameyears <- expand.grid(unique(df_example$Year),
unique(df_example$Name),
stringsAsFactors = FALSE)
# Use mapply() to get them
lst_changes <- mapply(FUN = func_find_changes,
year = df_nameyears$Var1,
person = df_nameyears$Var2,
MoreArgs = list(data = df_example),
SIMPLIFY = FALSE)发布于 2017-03-28 16:01:42
我没有时间去看最后的结果,但我可以看到这可能会帮助你或其他人开始。这样做的目的是将数据集按个体划分,然后对年数进行索引,并找出共现频率。
另外,请使用dput以便我们可以轻松地重新创建数据.我所做的是:
a <- textConnection('Year Name Activity
2010 Mark Tennis
2010 Mark Swim
2010 Mark Basketball
2010 Mark Volleyball
2010 Tom Swim
2010 Rachale Tennis
2010 Rachale Waterball
2010 Rachale Yoga
2010 Mary Volleyball
2010 Mary Yoga
2011 Mark Tennis
2011 Mark Swim
2011 Tom Volleyball
2011 Tom Waterball
2011 Tom Swim
2011 Rachale Tennis
2011 Rachale Waterball
2011 Rachale Yoga
2011 Rachale Swim
2011 Mary Volleyball')%>% read.table %>% {
colnames(.) <- as.character(.[1,])
.[-1,]
}
lapply(split(a, a$Name), function(i){
counts <- count(i, Year)
n_change <- as.numeric(counts[nrow(counts),2] - counts[1,2])
if(n_change < 0){
add <- 0
drop <- n_change * -1
}else {
add <- n_change
drop <- 0
}
check_act <- acast(i, Activity ~ Year, value.var = "Year")
list(add = add, drop = drop, adply(check_act, 2, is.na))
})
# $Mark
# $Mark$add
# [1] 0
#
# $Mark$drop
# [1] 2
#
# $Mark[[3]]
# X1 Basketball Swim Tennis Volleyball
# 1 2010 FALSE FALSE FALSE FALSE
# 2 2011 TRUE FALSE FALSE TRUE
#
#
# $Mary
# $Mary$add
# [1] 0
#
# $Mary$drop
# [1] 1
#
# $Mary[[3]]
# X1 Volleyball Yoga
# 1 2010 FALSE FALSE
# 2 2011 FALSE TRUE
#
#
# $Rachale
# $Rachale$add
# [1] 1
#
# $Rachale$drop
# [1] 0
#
# $Rachale[[3]]
# X1 Swim Tennis Waterball Yoga
# 1 2010 TRUE FALSE FALSE FALSE
# 2 2011 FALSE FALSE FALSE FALSE
#
#
# $Tom
# $Tom$add
# [1] 2
#
# $Tom$drop
# [1] 0
#
# $Tom[[3]]
# X1 Swim Volleyball Waterball
# 1 2010 FALSE TRUE TRUE
# 2 2011 FALSE FALSE FALSE
#
# 发布于 2017-03-28 16:07:08
您可以很容易地使用data.table查找按个人分组的多年来活动的变化:
DF <- structure(
list(Year = c(2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,
2011, 2011),
Name = c("Mark", "Mark", "Mark", "Mark", "Tom",
"Rachale", "Rachale", "Rachale", "Mary", "Mary", "Mark", "Mark",
"Tom", "Tom", "Tom", "Rachale", "Rachale", "Rachale", "Rachale",
"Mary"),
Activity = c("Tennis", "Swim", "Basketball", "Volleyball",
"Swim", "Tennis", "Waterball", "Yoga", "Volleyball", "Yoga",
"Tennis", "Swim", "Volleyball", "Waterball", "Swim", "Tennis",
"Waterball", "Yoga", "Swim", "Volleyball")),
.Names = c("Year", "Name", "Activity"),
row.names = c(NA, 20L), class = "data.frame")
library(data.table)
DT <- data.table(DF)
yearly_count <- DT[, .N, by = c('Name', 'Year')]
print(yearly_count)
change <- yearly_count[, list(change = diff(N)), by = Name]
print(change)这将产生以下结果:
> print(yearly_count)
Name Year N
1: Mark 2010 4
2: Tom 2010 1
3: Rachale 2010 3
4: Mary 2010 2
5: Mark 2011 2
6: Tom 2011 3
7: Rachale 2011 4
8: Mary 2011 1
> print(change)
Name change
1: Mark -2
2: Tom 2
3: Rachale 1
4: Mary -1你的数据只有2年,所以只有一个值代表2010年到2011年的变化。马克放弃了2项活动,汤姆又增加了2项,等等。
https://stackoverflow.com/questions/43073811
复制相似问题