我是法国学生,所以我的英语不是很好,抱歉。
我们将包含物种及其位置的数据集转换为这些物种及其位置的对应起源。
数据集有600~列,命名为U、A、W、L或E(物种起源),其中0或1(位置处物种的存在/不存在)
和2个具有配位物的列(对应于数据收集站)。
超过8000条线路,用于每个找到数据的站点。
数据集的简化如下所示:
[Longitude] [Latitude] [A][U][U][L][E][A][U] ... [+600]
[1,] -5.89 35.71 0 0 1 0 0 1 1
[2,] -5.89 35.81 0 1 0 0 0 0 1
[3,] -5.89 36.01 1 0 0 1 1 1 0
[4,] -5.89 36.1 0 0 0 1 0 1 0
[1,] -5.89 36.21 1 1 1 0 0 1 1
[2,] -5.79 35.81 1 1 0 1 0 1 0
[3,] -5.79 35.91 0 1 0 0 0 0 1
[4,] -5.79 36.01 1 1 0 1 0 1 0
[+8000]我们想要做的是某种条件求和,其中所有来源被重新分组到每一列中,并将其内容求和,如下所示:
`
[Longitude] [Latitude] [A][U][L][W][E]
[1,] -5.89 35.71 12 6 5 0 13
[2,] -5.89 35.81 5 1 8 10 20
[3,] -5.89 36.01 1 28 3 6 2
[4,] -5.89 36.1 4 25 0 1 11
[1,] -5.89 36.21 9 1 9 3 5
[2,] -5.79 35.81 6 5 12 1 8
[3,] -5.79 35.91 5 2 7 15 10
[4,] -5.79 36.01 10 3 5 12 4
[+8000]只有A、U、L、E、W必须求和。经度、纬度和行数必须保持不变。
我们尝试了aggregate或tapply,但没有成功,但可能需要一个循环...
有什么想法吗?
谢谢
MacOS函数: espOri => df espagg => df.agg
espagg <- aggregate(. ~ Longitude + Latitude,
especeOri,
FUN = sum)
aggregate.columns <- function(especeOri, column.names)
{
for (column.name in column.names) {
especeOri[[column.name]] <- rowSums(subset(espagg, select = grep(paste(column.name, ".*", sep = ""), colnames(especeOri))))
}
return(especeOri)
}
aggregate.column.names <- c("A", "U", "L", "E", "W")
espagg <- aggregate.columns(espagg, aggregate.column.names)
espagg <- subset(especeOri, select = c("Longitude", "Latitude", aggregate.column.names))
View(espagg)数据集的dput
dput(especeOri[1:10,1:20])
structure(list(Longitude = c(-5.89, -5.89, -5.89, -5.89, -5.89,
-5.79, -5.79, -5.79, -5.79, -5.69), Latitude = c(35.71, 35.81,
36.01, 36.11, 36.21, 35.81, 35.91, 36.01, 36.11, 35.81), L = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.1 = c(0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
0L, 1L), A = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.2 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), E = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), E.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.4 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.5 = c(0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L), U.6 = c(1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 1L), L.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
U.7 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), U.8 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L), U.9 = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), U.10 = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), A.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), U.11 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), row.names = c(NA, 10L), class = "data.frame")google drive有所有的数据集,一些解释和我们的脚本。https://drive.google.com/drive/folders/1fnWnZZDC3gyWTtSoqi_l7Wuhje5qpJmL?usp=sharing
编辑:添加了一些用于说明的经度和纬度的值以及一个屏幕截图
发布于 2020-11-24 23:50:15
下面的代码应该可以完成这项工作。
df <- data.frame(Longitude = c(-5.89, -5.89, -5.89, -5.89, -5.89, -5.79, -5.79, -5.79, -5.89, -5.89),
Latitude = c(35.71, 35.81, 36.01, 36.1, 36.21, 35.81, 35.91, 36.01, 35.71, 35.81),
A = c(0, 0, 1, 0, 1, 1, 0, 1, 1, 1),
U = c(0, 1, 0, 0, 1, 1, 1, 1, 1, 1),
U = c(1, 0, 0, 0, 1, 0, 0, 0, 1, 1),
L = c(0, 0, 1, 1, 0, 1, 0, 1, 1, 1),
E = c(0, 0, 1, 0, 0, 0, 0, 0, 1, 1),
A = c(1, 0, 1, 1, 1, 1, 0, 1, 1, 1),
U = c(1, 1, 0, 0, 1, 0, 1, 0, 1, 1))
df.agg <- aggregate(. ~ Longitude + Latitude,
df,
FUN = sum)
df.agg$A <- rowSums(subset(df.agg, select = grep("A.*", colnames(df.agg))))
df.agg$U <- rowSums(subset(df.agg, select = grep("U.*", colnames(df.agg))))
df.agg$L <- rowSums(subset(df.agg, select = grep("L.*", colnames(df.agg))))
df.agg$E <- rowSums(subset(df.agg, select = grep("E.*", colnames(df.agg))))
df.agg <- subset(df.agg, select = c(Longitude, Latitude, A, U, L, E))更新
OP用户要求一种解决方案,他/她不必显式地为rowSums编写代码,因为他/她必须实际写出许多列,即这是不方便的。下面的代码应该可以完成这项工作。
df <- structure(list(Longitude = c(-5.89, -5.89, -5.89, -5.89, -5.89,
-5.79, -5.79, -5.79, -5.79, -5.69), Latitude = c(35.71, 35.81,
36.01, 36.11, 36.21, 35.81, 35.91, 36.01, 36.11, 35.81), L = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.1 = c(0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
0L, 1L), A = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.2 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), E = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), U.3 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), E.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.4 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), U.5 = c(0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L), U.6 = c(1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 1L), L.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
U.7 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), U.8 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L), U.9 = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), U.10 = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), A.1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), U.11 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), row.names = c(NA, 10L), class = "data.frame")
df.agg <- aggregate(. ~ Longitude + Latitude,
df,
FUN = sum)
# This function aggregates rows if their column names have the same start.
# Suppose we have a data frame with column names A, A.1, and A.2. Then,
# the rows of these columns are aggregated using sum. So,
# A 1 1 0
# A.1 2 1 0
# A.2 0 0 1
# becomes
# A 3 2 1
aggregate.columns <- function(df.my, column.names)
{
for (column.name in column.names) {
df.my[[column.name]] <- df.my[[column.name]] +
rowSums(subset(df.my,
select = grep(paste(column.name, ".[1-9]+", sep = ""),
colnames(df.my))))
}
return(df.my)
}
aggregate.column.names <- c("A", "U", "L", "E")
df.agg <- aggregate.columns(df.agg, aggregate.column.names)
df.agg <- subset(df.agg, select = c("Longitude", "Latitude", aggregate.column.names))
df.agg实现这一点的关键是这一行。
grep(paste(column.name, ".[1-9]+", sep = ""), colnames(df.my))它返回所有以变量column.name的当前值开头,后跟一个点和任意数字序列的列名,例如,当column.name的值为A时,则应返回A.1、A.345、A.67、A.9798、A.111111。请检查!
更新3
在OP的用户提供数据后,我提出了以下内容。这包括用于重命名的函数。这是必要的,因为数据框具有具有相同名称的列。例如,此函数将一系列列名A, A, A, A转换为A, A.1, A.2, A.3。
climate <- read.table("Data_climate.txt", header = T)
poissons <- read.table("Data_fish.txt", header = T)
traitsNA <- read.table("Data_traits.txt", header = T)
especes <- poissons [,-2]
especes2 <- especes [,-2]
especes3 <- especes2 [,-1]
colnames(especes3) <- traitsNA$Origin
especes44<-cbind(climate$Latitude,especes3)
especeOri <- cbind(climate$Longitude,especes44)
origine <- cbind(climate$ID_cellule,especeOri)
colnames(origine)[1] <- "ID_cellule"
colnames(origine)[2] <- "Longitude"
colnames(origine)[3] <- "Latitude"
colnames(especeOri)[1] <- "Longitude"
colnames(especeOri)[2] <- "Latitude"
rename.columns <- function(df)
{
unique.column.names <- unique(colnames(df))
for (unique.column.name in unique.column.names)
{
idxs.columns <- which(colnames(df) == unique.column.name)
df.tmp.with.new.col.names <- subset(df, select = idxs.columns)
colnames(df)[idxs.columns] <- colnames(df.tmp.with.new.col.names)
}
return(df)
}
especeOri <- rename.columns(especeOri)
espagg <- aggregate(. ~ Longitude + Latitude,
especeOri,
FUN = sum)
# This function aggregates rows if their column names have the same start.
# Suppose we have a data frame with column names A, A.1, and A.2. Then,
# the rows of these columns are aggregated using sum. So,
# A 1 1 0
# A.1 2 1 0
# A.2 0 0 1
# becomes
# A 3 2 1
aggregate.columns <- function(df.my, column.names)
{
for (column.name in column.names) {
df.my[[column.name]] <- df.my[[column.name]] +
rowSums(subset(df.my,
select = grep(paste(column.name, ".[1-9]+",
sep = ""),
colnames(df.my))))
}
return(df.my)
}
aggregate.column.names <- c("A", "U", "L", "E", "W")
espagg <- aggregate.columns(espagg, aggregate.column.names)
espagg <- subset(especeOri, select = c("Longitude", "Latitude", aggregate.column.names))哈!
发布于 2020-11-29 03:07:00
这是一个使用您提供的数据的tidyverse解决方案。
library(dplyr)
library(tidyr)
fish <- read.table("Data_fish.txt", header = T)
traits <- read.table("Data_traits.txt", header = T)
fish %>%
pivot_longer(-c(ID_cellule, Longitude, Latitude), names_to = "Species", values_to = "Occur") %>%
mutate(ID_cellule = factor(ID_cellule, levels = unique(ID_cellule))) %>% # use factor to fix the display order as-is
left_join(traits %>% select(Species, Origin), by = "Species") %>%
group_by(ID_cellule, Longitude, Latitude, Origin) %>%
summarise(Occur = sum(Occur)) %>%
pivot_wider(names_from = "Origin", values_from = "Occur")输出
# A tibble: 8,154 x 8
# Groups: ID_cellule, Longitude, Latitude [8,154]
ID_cellule Longitude Latitude A E L U W
<fct> <dbl> <dbl> <int> <int> <int> <int> <int>
1 ID1 -5.89 35.7 8 10 0 178 0
2 ID2 -5.89 35.8 11 10 0 234 0
3 ID3 -5.89 36.0 9 11 0 195 0
4 ID4 -5.89 36.1 12 10 0 227 0
5 ID5 -5.89 36.2 13 17 0 268 0
6 ID6 -5.79 35.8 9 8 0 205 0
7 ID7 -5.79 35.9 8 9 0 168 0
8 ID8 -5.79 36.0 11 14 0 262 0
9 ID9 -5.79 36.1 10 10 0 193 0
10 ID10 -5.69 35.8 9 10 0 230 0https://stackoverflow.com/questions/64989536
复制相似问题