在我的数据,我有10个独特的样本日期,每一个动物,我们测量的临床症状。在每一次约会中,两个人对每只动物采取临床症状(温度、肿胀等)。考虑到数据的所有其他部分,每只动物有四排,样本日期相同。对于其中的两行,有一个首字母,另两个行有一个不同的初始或安娜(当那个采样器那天不在时)。在我的数据中,我的目标是删除在同一天,4行中的2行(每只动物)中有一组首字母的行,而安娜则删除同日期的另外2行(同一动物)的首字母。
澄清:在首字母栏中还有其他的NAs,我想留在这里。例如,对于动物6,我想离开所有的NAs。但是对于其他有4行的动物,这些行中有两行填充了首字母,而另外两行有NAs,我想删除NA行。谢谢!
下面是一些示例代码:
Data <- data.frame(matrix(ncol = 3, nrow = 24))
colnames(Data) <- c('AnimalID', 'DateSampled', 'Initials')
Data$AnimalID <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6)
Data$DateSampled <- as.Date(c("2021-10-13", "2021-10-13", "2021-10-13", "2021-10-13", "2021-10-27", "2021-10-27", "2021-10-27", "2021-10-27", "2021-11-10", "2021-11-10", "2021-11-10", "2021-11-10", "2021-11-24", "2021-11-24", "2021-11-24", "2021-11-24", "2021-12-01", "2021-12-01", "2021-12-01", "2021-12-01", "2021-12-05", "2021-12-05"))
Data$Initials <- c("AB", "AB", NA, NA, "AB", "AB", "CD", "CD", "AB", "AB", NA, NA, "AB", "AB", "CD", "CD", "AB", "AB", NA, NA, NA, NA, NA, NA)期望产出:
AnimalID | DateSampled | Initials
1 | "2021-10-13" | AB
1 | "2021-10-13" | AB
2 | "2021-10-27" | AB
2 | "2021-10-27" | AB
2 | "2021-10-27" | CD
2 | "2021-10-27" | CD
3 | "2021-11-10" | AB
3 | "2021-11-10" | AB
4 | "2021-11-24" | AB
4 | "2021-11-24" | AB
4 | "2021-11-24" | CD
4 | "2021-11-24" | CD
5 | "2021-12-01" | AB
5 | "2021-12-01" | AB
6 | "2021-12-05" | NA
6 | "2021-12-05" | NA不管是for循环还是条件向量,如果有"AB“(或任何其他一组首字母)和相同的动物id和样本日期的" NA”,我想删除它们中有NA的行。谢谢你的帮助!
发布于 2022-09-22 03:16:09
下面是使用dplyr实现这一目标的一种方法。filter(!is.na(Initials))将用NA删除所有行。distinct()将删除重复的行:
library(dplyr)
Data %>%
filter(!is.na(Initials)) %>%
distinct()
EweID DateSampled Initials
1 1 2021-10-13 AB
2 2 2021-10-27 AB
3 2 2021-10-27 CD
4 3 2021-11-10 AB
5 4 2021-11-24 AB
6 4 2021-11-24 CD
7 5 2021-12-01 AB更新
谢谢你澄清你的输出,这里有一个方法来实现。首先是为每种动物创建一个中间数据框架,并计算每个组的NA数:
Number_of_NA = Data %>%
group_by(AnimalID)%>%
summarise(n = sum(is.na(Initials)))
> Number_of_NA
# A tibble: 7 x 2
AnimalID n
<dbl> <int>
1 1 2
2 2 0
3 3 2
4 4 0
5 5 2
6 6 4
7 7 4如果我正确地理解了NA中要保留的组,那么在NA中总是有4个值。您可以使用它像以前一样过滤数据帧中的所有NA,然后只使用4 NAs加入组:
Data %>% filter(!is.na(Initials)) %>%
full_join(filter(Data, AnimalID %in% Number_of_NA$AnimalID[Number_of_NA$n == 4]))
AnimalID DateSampled Initials
1 1 2021-10-13 AB
2 1 2021-10-13 AB
3 2 2021-10-27 AB
4 2 2021-10-27 AB
5 2 2021-10-27 CD
6 2 2021-10-27 CD
7 3 2021-11-10 AB
8 3 2021-11-10 AB
9 4 2021-11-24 AB
10 4 2021-11-24 AB
11 4 2021-11-24 CD
12 4 2021-11-24 CD
13 5 2021-12-01 AB
14 5 2021-12-01 AB
15 6 2021-12-05 <NA>
16 6 2021-12-05 <NA>
17 6 2021-12-05 <NA>
18 6 2021-12-05 <NA>
19 7 2021-12-15 <NA>
20 7 2021-12-15 <NA>
21 7 2021-12-15 <NA>
22 7 2021-12-15 <NA>数据
Data = structure(list(AnimalID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3,
3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7), DateSampled = structure(c(18913,
18913, 18913, 18913, 18927, 18927, 18927, 18927, 18941, 18941,
18941, 18941, 18955, 18955, 18955, 18955, 18962, 18962, 18962,
18962, 18966, 18966, 18966, 18966, 18976, 18976, 18976, 18976
), class = "Date"), Initials = c("AB", "AB", NA, NA, "AB", "AB",
"CD", "CD", "AB", "AB", NA, NA, "AB", "AB", "CD", "CD", "AB",
"AB", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
-28L), class = "data.frame")更新2
以下是与您的筛选匹配的修改。在第一个数据框架中,通过group_by()动物ID和数据,计算出NA (with_NA)数和观察total_n总数。在这种情况下,如果with_NA等于total_n,这意味着只有NA可以用于此ID和日期,因此将保留这些NA。
library(dplyr)
df_filt = Data %>%
group_by(AnimalID, DateSampled)%>%
summarise(with_NA = sum(is.na(Initials)), total_n = n(),
to_filter = with_NA == total_n) %>%
filter(to_filter == TRUE)
# A tibble: 3 x 5
# Groups: AnimalID [3]
AnimalID DateSampled with_NA total_n to_filter
<dbl> <date> <int> <int> <lgl>
1 3 2021-11-11 1 1 TRUE
2 6 2021-12-05 4 4 TRUE
3 7 2021-12-16 2 2 TRUE 然后,我们可以使用类似于上一次的方法过滤数据帧中的所有NA,然后根据上面的dataframe加入我们想要保留的NA:
Data %>% filter(!is.na(Initials)) %>%
full_join(filter(Data, AnimalID %in% df_filt$AnimalID & DateSampled %in% df_filt$DateSampled))%>%
arrange(AnimalID)
AnimalID DateSampled Initials
1 1 2021-10-13 AB
2 1 2021-10-13 AB
3 2 2021-10-27 AB
4 2 2021-10-27 AB
5 2 2021-10-27 CD
6 2 2021-10-27 CD
7 3 2021-11-10 AB
8 3 2021-11-10 AB
9 3 2021-11-11 <NA>
10 4 2021-11-24 AB
11 4 2021-11-24 AB
12 4 2021-11-24 CD
13 4 2021-11-24 CD
14 5 2021-12-01 AB
15 5 2021-12-01 AB
16 6 2021-12-05 <NA>
17 6 2021-12-05 <NA>
18 6 2021-12-05 <NA>
19 6 2021-12-05 <NA>
20 7 2021-12-15 CB
21 7 2021-12-16 <NA>
22 7 2021-12-16 <NA>在这种情况下,所有具有匹配日期的NA和带有初始值的AnimalID都将被丢弃,并且只有不包含此日期的NA才会被保留。
注意,我在这里稍微修改了数据,以反映所需的输出
数据2
> Data
AnimalID DateSampled Initials
1 1 2021-10-13 AB
2 1 2021-10-13 AB
3 1 2021-10-13 <NA>
4 1 2021-10-13 <NA>
5 2 2021-10-27 AB
6 2 2021-10-27 AB
7 2 2021-10-27 CD
8 2 2021-10-27 CD
9 3 2021-11-10 AB
10 3 2021-11-10 AB
11 3 2021-11-10 <NA>
12 3 2021-11-11 <NA>
13 4 2021-11-24 AB
14 4 2021-11-24 AB
15 4 2021-11-24 CD
16 4 2021-11-24 CD
17 5 2021-12-01 AB
18 5 2021-12-01 AB
19 5 2021-12-01 <NA>
20 5 2021-12-01 <NA>
21 6 2021-12-05 <NA>
22 6 2021-12-05 <NA>
23 6 2021-12-05 <NA>
24 6 2021-12-05 <NA>
25 7 2021-12-15 CB
26 7 2021-12-15 <NA>
27 7 2021-12-16 <NA>
28 7 2021-12-16 <NA>
Data = structure(list(AnimalID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3,
3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7), DateSampled = structure(c(18913,
18913, 18913, 18913, 18927, 18927, 18927, 18927, 18941, 18941,
18941, 18942, 18955, 18955, 18955, 18955, 18962, 18962, 18962,
18962, 18966, 18966, 18966, 18966, 18976, 18976, 18977, 18977
), class = "Date"), Initials = c("AB", "AB", NA, NA, "AB", "AB",
"CD", "CD", "AB", "AB", NA, NA, "AB", "AB", "CD", "CD", "AB",
"AB", NA, NA, NA, NA, NA, NA, "CB", NA, NA, NA)), row.names = c(NA,
-28L), class = "data.frame")发布于 2022-09-22 03:36:26
如果你能提供一个预期的产出,我会更好。
过滤的逻辑有点难理解。
根据我所能收集的内容,如果您只想移除首字母列中的所有NA数据并删除重复的行。
Data <- Data[!is.na(Data$Initials),]
Data <- Data[!duplicated(Data),]这就是我想要使用tidyverse来实现的
distinct将只输出与datafilter不同的行,将删除首字母字段中任何带有NA的行。
library(tidyverse)
Data %>%
distinct() %>%
filter(!is.na(Initials))
# EweID DateSampled Initials
# 1 1 2021-10-13 AB
# 2 2 2021-10-27 AB
# 3 2 2021-10-27 CD
# 4 3 2021-11-10 AB
# 5 4 2021-11-24 AB
# 6 4 2021-11-24 CD
# 7 5 2021-12-01 AB如果您仍然希望包括NA行,其中EweID在NA旁边没有任何其他首字母。只需添加另一步就可以找到EweID-DateSampled,它的首字母列中只有NA。
Data %>% distinct() %>%
group_by(EweID, DateSampled) %>%
summarise("var"=paste(Initials, collapse='-'))
# EweID DateSampled var
# 1 1 2021-10-13 AB-NA
# 2 2 2021-10-27 AB-CD
# 3 3 2021-11-10 AB-NA
# 4 4 2021-11-24 AB-CD
# 5 5 2021-12-01 AB-NA
# 6 6 2021-12-02 NA过滤NA行并将其rbind到上面的输出
Data %>% distinct() %>%
group_by(EweID, DateSampled) %>%
summarise("var"=paste(Initials, collapse='-')) %>%
filter(var=="NA")
# EweID DateSampled var
# 1 6 2021-12-02 NAhttps://stackoverflow.com/questions/73808679
复制相似问题