我有一个数据集,对响应变量有5个级别的处理。
假设我在土壤含水量的40%、30%、20%和10%的5个水平上测定了土壤氮含量。每级我都有5个副本。
现在,我想计算每个复制的非标准化(最优- 40%,最优- 30%,最优- 20%,最优- 10%)和标准化(最- 40% /最,最- 30% /最,等等)。
有没有办法在R和Tidyverse一起做这件事?我在做“循环”功能方面有问题。每个治疗水平重复5个。
df<- data.frame(Soilwater = c("optimal", "optimal", "optimal", "optimal", "optimal",
"40", "40", "40", "40", "40",
"30","30","30","30","30",
"20", "20","20","20","20",
"10","10","10","10","10",
"optimal", "optimal", "optimal", "optimal", "optimal",
"40", "40", "40", "40", "40",
"30","30","30","30","30",
"20", "20","20","20","20",
"10","10","10","10","10"),
Diversity = c("High","High","High","High","High","High","High","High","High","High", "High","High","High","High","High","High","High","High","High","High",
"High","High","High","High","High",
"Low", "Low", "Low","Low","Low","Low","Low","Low","Low","Low",
"Low","Low","Low","Low","Low","Low","Low","Low","Low","Low",
"Low","Low","Low","Low","Low"),
Soil_N = c(50,45, 49, 48, 49, 69, 68, 69, 70, 67, 79, 78, 79, 78, 77, 89, 89, 87, 88, 89, 99, 98, 97, 98, 98, 120,
121, 121, 120, 122, 134, 131, 132, 134, 131, 145, 148, 149, 147,
148, 159, 159, 157, 156, 157, 169, 167, 167, 168, 164))我使用了@JonSpring建议的下面的代码,这是非常有用的。
df %>%
# First, we can add a `Replicate` number based on position within
# each Soilwater/Diversity cohort.
group_by(Soilwater, Diversity) %>%
mutate(Replicate = row_number()) %>%
# Calc diff vs. experiment with same Diversity & Replicate, optimal Soilwater
group_by(Diversity, Replicate) %>%
mutate(Difference = Soil_N - Soil_N[Soilwater == "optimal"]) %>%
# Summarize avg diffs
group_by(Soilwater, Diversity) %>%
summarize(Mean_Diff = mean(Difference), .groups = "drop")但是,我意识到,首先我需要为optimal土壤水位做一个平均值,然后计算这个平均值与其他土壤水位之间的差异,我尝试了下面的代码(用mean函数计算差值之前的optimal土壤水位平均值)。但这是行不通的。
df%>% group_by(Soilwater, Diversity)%>% mutate(Replicate = row_number())%>%
group_by(Diversity, Replicate)%>% mutate(Difference = mean(Soil_N[Soilwater=="optimal"])- Soil_N)发布于 2021-11-07 14:57:07
很难理解你的问题。所以我从对你起作用的东西开始。
作为R和tidyverse的新手,请注意%>% (管道)将您的操作链在(启动)对象上。
您可以将操作的任何状态/阶段分配给新对象(也称为变量)。
我还建议您在处理问题时创建几个“临时”对象,以存储问题/算法的步骤。这会让你更好地感受到你所拥有的。随着时间的推移,您将获得足够的经验来连锁操作,并避免某些临时阶段/对象。
为此,我介绍了一个“临时”结果/对象,正如您的描述所建议的那样--直到那时为止,我为您指定了interim_df <- ...。
library(dplyr)
interim_df <- df %>%
group_by(Soilwater, Diversity) %>%
mutate(Replicate = row_number()) %>%
group_by(Diversity, Replicate)这将产生一个对象interim_df。让我们来看看
interim_df
# A tibble: 50 x 4
# Groups: Diversity, Replicate [10]
Soilwater Diversity Soil_N Replicate
<chr> <chr> <dbl> <int>
1 optimal High 50 1
2 optimal High 45 2
3 optimal High 49 3
4 optimal High 48 4
5 optimal High 49 5
6 40 High 69 1
7 40 High 68 2
8 40 High 69 3
9 40 High 70 4
10 40 High 67 5好的。我们有一个有4个变量的第50行..。这似乎是你满意的数据结构。您还拥有一个“分组数据”。当您想要对整个数据(或数据文件的其他部分)进行操作时,一定要使用ungroup()。
interim_df <- interim_df %>% ungroup()您可以“提取”您的“最优”度量,并计算这个“新”df/tibble的平均值。
mean_optimal <- interim_df %>%
filter(Soilwater == "optimal") %>%
summarise(MeanOptimal = mean(SoilN) # we calculate/summarise the mean over the part we want这给了你
# A tibble: 1 x 1
MeanOptimal
<dbl>
1 84.5要明确的是,我们现在有了另一个带有一个变量/列的tibble。这可以在您的interim_df中使用。但是,请确保了解如何从tibble中“提取”列(也可以将其作为可重用的向量)。基-R表示法$允许您直接访问列(向量);tidyverse提供pull()函数。
final <- interim_df %>% mutate(Difference = mean_optimal$MeanOptimal - Soil_N)
final
# A tibble: 50 x 5
Soilwater Diversity Soil_N Replicate Difference
<chr> <chr> <dbl> <int> <dbl>
1 optimal High 50 1 34.5
2 optimal High 45 2 39.5
3 optimal High 49 3 35.5
4 optimal High 48 4 36.5
5 optimal High 49 5 35.5
6 40 High 69 1 15.5
7 40 High 68 2 16.5
8 40 High 69 3 15.5
9 40 High 70 4 14.5
10 40 High 67 5 17.5您还可以将mean_optimal$MeanOptimal作为interim_df %>% mutate(MeanOptimal = mean_optimal$MeanOptimal)的新列“添加”到interim_df中,然后执行不同的操作。
https://stackoverflow.com/questions/69872913
复制相似问题