我正在尝试计算每个人的平均独特水果数量(我通常的实践数据)。这两行代码都能很好地工作:
with(df, tapply(fruit, names, FUN = function(x) length(unique(x))))->uniques
sum(uniques)/length(unique(df$names))
aggregate(df[,"fruit"], by=list(id=names), FUN = function(x) length(unique(x)))->d1
sum(d1$x)/length(unique(df$names))我的问题是,当我在真实数据上使用代码时,它不起作用。我的真实数据是处方数据,我想要的是每个人的平均独特药物数量。使用tapply代码,它似乎创建了原始df中不存在的全新患者ids。它还返回了1000个NA值。我的id列中没有缺失值,drug_code列中也没有缺失值
with(dt3, tapply(drug_code, id, FUN = function(x) length(unique(x))))->uniques
head(uniques)
uniques
Patient HAI0000001 NA
Patient HAI0000003 NA
Patient HAI0000008 NA
Patient HAI0000010 NA
Patient HAI0000014 NA
Patient HAI0000020 NA
table(dt3$id=="Patient HAI0000001") ##checking to see if HA10000001 occurs in original df. the dim of df are 228954 rows and 5 cols
FALSE
228954对于聚合代码,我得到一个错误:
aggregate(dt3[,"drug_code"], by=list(id=id), FUN = function(x) length(unique(x)))->d1
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length我不明白发生了什么。我的实际数据类似于我的实践数据,因为它有一个id列和一个药品/水果列。在这两个df中都没有丢失数据。我知道lapply更适合于数据帧,但我不一定需要df back。在任何情况下,tapply代码都适用于作为df的练习数据。有谁知道这里发生了什么吗?
练习DF:
names<-as.character(c("john", "john", "john", "john", "john", "mary", "mary","mary","mary","mary", "jim", "sylvia","ted","ted","mary", "sylvia", "jim", "ted", "john", "ted"))
dates<-as.Date(c("2010-07-01", "2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01", "2010-08-12", "2010-11-11", "2010-05-12", "2010-12-03", "2010-07-12", "2010-12-21", "2010-02-18", "2010-10-29", "2010-08-13", "2010-11-11", "2010-05-12", "2010-04-01", "2010-05-06", "2010-09-28", "2010-11-28" ))
fruit<-as.character(c("kiwi","apple","banana","orange","apple","orange","apple","orange", "apple", "apple", "pineapple", "peach", "nectarine", "grape", "melon", "apricot", "plum", "lychee", "watermelon", "apple" ))
df<-data.frame(names,dates,fruit) 真实数据示例:
head(dt3)
id quantity date_of_claim drug_code index
1 Patient HAI0000560 1 2009-10-15 R03AC02 2010-04-06
2 Patient HAI0000560 1 2009-10-15 R03AK06 2010-04-06
3 Patient HAI0000560 30 2009-10-15 R03BB04 2010-04-06
4 Patient HAI0000560 30 2009-10-15 A02BC01 2010-04-06
5 Patient HAI0000560 50 2009-10-15 M02AA15 2010-04-06
6 Patient HAI0000560 30 2009-10-15 N02BE51 2010-04-06发布于 2013-07-19 00:40:56
在您的示例中,您要求的是一个单独的数字:(unique(fruits)) -id中某个特定向量的所有长度的平均值。这将首先显示个人的唯一计数,然后是平均函数结果:
> with(df, tapply(fruit, names, function(x) length(unique(x)) ))
jim john mary sylvia ted
2 5 3 2 4
> mean ( with(df, tapply(fruit, names, function(x) length(unique(x)) )) )
[1] 3.2我要指出的是,您在上面的代码中包含特定值的测试有一个尾随空格,这可能会导致问题。"string "不等于"string"。我在我的.Rprofile文件中放了一份在pkg::gdata中使用trim函数的副本,这样我就可以更容易地处理这种可能性。
发布于 2013-07-19 00:40:58
我可能遗漏了一些东西,但这里不能使用简单的tapply吗?下面的代码行计算了每个人吃不同水果的数量
x=tapply(df$fruit,df$names,function(x){length(unique(x))})然后mean(x)会给你所有人的平均值?
https://stackoverflow.com/questions/17729060
复制相似问题