首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >分组变量差异的计算平均值

分组变量差异的计算平均值
EN

Stack Overflow用户
提问于 2021-11-07 13:34:36
回答 1查看 82关注 0票数 0

我有一个数据集,对响应变量有5个级别的处理。

假设我在土壤含水量的40%、30%、20%和10%的5个水平上测定了土壤氮含量。每级我都有5个副本。

现在,我想计算每个复制的非标准化(最优- 40%,最优- 30%,最优- 20%,最优- 10%)和标准化(最- 40% /最,最- 30% /最,等等)。

有没有办法在R和Tidyverse一起做这件事?我在做“循环”功能方面有问题。每个治疗水平重复5个。

代码语言:javascript
复制
df<- data.frame(Soilwater = c("optimal", "optimal", "optimal", "optimal", "optimal", 
       "40", "40", "40", "40", "40", 
       "30","30","30","30","30", 
       "20", "20","20","20","20",
       "10","10","10","10","10", 
       "optimal", "optimal", "optimal", "optimal", "optimal", 
       "40", "40", "40", "40", "40", 
       "30","30","30","30","30", 
       "20", "20","20","20","20",
       "10","10","10","10","10"), 
Diversity = c("High","High","High","High","High","High","High","High","High","High",   "High","High","High","High","High","High","High","High","High","High",
       "High","High","High","High","High", 
       "Low", "Low", "Low","Low","Low","Low","Low","Low","Low","Low",
       "Low","Low","Low","Low","Low","Low","Low","Low","Low","Low",
       "Low","Low","Low","Low","Low"),
Soil_N = c(50,45, 49, 48, 49, 69, 68, 69, 70, 67, 79, 78, 79, 78, 77, 89, 89, 87, 88, 89, 99, 98, 97, 98, 98, 120,    
   121,    121,    120,    122,    134,    131,    132,    134,    131,    145,    148,    149,    147,    
   148,    159,    159,    157,    156,    157,    169,    167,    167,    168,    164))

我使用了@JonSpring建议的下面的代码,这是非常有用的。

代码语言:javascript
复制
df %>%
    # First, we can add a `Replicate` number based on position within 
    # each Soilwater/Diversity cohort.
    group_by(Soilwater, Diversity) %>%
    mutate(Replicate = row_number()) %>%

    # Calc diff vs. experiment with same Diversity & Replicate, optimal Soilwater 
    group_by(Diversity, Replicate) %>%
    mutate(Difference = Soil_N - Soil_N[Soilwater == "optimal"]) %>%

    # Summarize avg diffs
    group_by(Soilwater, Diversity) %>%
    summarize(Mean_Diff = mean(Difference), .groups = "drop")

但是,我意识到,首先我需要为optimal土壤水位做一个平均值,然后计算这个平均值与其他土壤水位之间的差异,我尝试了下面的代码(用mean函数计算差值之前的optimal土壤水位平均值)。但这是行不通的。

代码语言:javascript
复制
df%>%       group_by(Soilwater, Diversity)%>%       mutate(Replicate = row_number())%>%        
group_by(Diversity, Replicate)%>%       mutate(Difference = mean(Soil_N[Soilwater=="optimal"])- Soil_N)
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-11-07 14:57:07

很难理解你的问题。所以我从对你起作用的东西开始。

作为R和tidyverse的新手,请注意%>% (管道)将您的操作链在(启动)对象上。

您可以将操作的任何状态/阶段分配给新对象(也称为变量)。

我还建议您在处理问题时创建几个“临时”对象,以存储问题/算法的步骤。这会让你更好地感受到你所拥有的。随着时间的推移,您将获得足够的经验来连锁操作,并避免某些临时阶段/对象。

为此,我介绍了一个“临时”结果/对象,正如您的描述所建议的那样--直到那时为止,我为您指定了interim_df <- ...

代码语言:javascript
复制
library(dplyr)

interim_df <- df %>%
    group_by(Soilwater, Diversity)  %>%
    mutate(Replicate = row_number()) %>%
    group_by(Diversity, Replicate)

这将产生一个对象interim_df。让我们来看看

代码语言:javascript
复制
interim_df
# A tibble: 50 x 4
# Groups:   Diversity, Replicate [10]
   Soilwater Diversity Soil_N Replicate
   <chr>     <chr>      <dbl>     <int>
 1 optimal   High          50         1
 2 optimal   High          45         2
 3 optimal   High          49         3
 4 optimal   High          48         4
 5 optimal   High          49         5
 6 40        High          69         1
 7 40        High          68         2
 8 40        High          69         3
 9 40        High          70         4
10 40        High          67         5

好的。我们有一个有4个变量的第50行..。这似乎是你满意的数据结构。您还拥有一个“分组数据”。当您想要对整个数据(或数据文件的其他部分)进行操作时,一定要使用ungroup()

代码语言:javascript
复制
interim_df <- interim_df %>% ungroup()

您可以“提取”您的“最优”度量,并计算这个“新”df/tibble的平均值。

代码语言:javascript
复制
mean_optimal <- interim_df %>%
    filter(Soilwater == "optimal") %>%
    summarise(MeanOptimal = mean(SoilN)   # we calculate/summarise the mean over the part we want

这给了你

代码语言:javascript
复制
# A tibble: 1 x 1
  MeanOptimal
        <dbl>
1        84.5

要明确的是,我们现在有了另一个带有一个变量/列的tibble。这可以在您的interim_df中使用。但是,请确保了解如何从tibble中“提取”列(也可以将其作为可重用的向量)。基-R表示法$允许您直接访问列(向量);tidyverse提供pull()函数。

代码语言:javascript
复制
final <- interim_df %>% mutate(Difference = mean_optimal$MeanOptimal - Soil_N)
final
# A tibble: 50 x 5
   Soilwater Diversity Soil_N Replicate Difference
   <chr>     <chr>      <dbl>     <int>      <dbl>
 1 optimal   High          50         1       34.5
 2 optimal   High          45         2       39.5
 3 optimal   High          49         3       35.5
 4 optimal   High          48         4       36.5
 5 optimal   High          49         5       35.5
 6 40        High          69         1       15.5
 7 40        High          68         2       16.5
 8 40        High          69         3       15.5
 9 40        High          70         4       14.5
10 40        High          67         5       17.5

您还可以将mean_optimal$MeanOptimal作为interim_df %>% mutate(MeanOptimal = mean_optimal$MeanOptimal)的新列“添加”到interim_df中,然后执行不同的操作。

票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/69872913

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档