我有一个从0.0 (无血缘关系)到1.0 (同卵双胞胎)的连续体中相互关联的鸟类的列表。在某个阈值(例如0.25),它们对于下游分析来说太相关了,我想从数据集中删除其中一个。然而,有时,个体与多只其他鸟类相关,在我的数据集中(~1700),这很快就变得复杂起来。有没有人有代码能够以一种最小化数据集损失的方式删除相关的个体?在下面的示例数据中,Ind001与Ind002和Ind004相关,但我想只删除Ind001,而不是同时删除Ind002和Ind004:
示例数据:
pair.no ind1.id ind2.id relatedness
1038723 Ind001 Ind002 1.0
1038895 Ind001 Ind003 0.2
1280057 Ind001 Ind004 0.9
1389905 Ind002 Ind003 0.0
1390069 Ind002 Ind004 0.1
1390069 Ind003 Ind004 0.1
1390069 MNP002 MSW004 0.1谢谢,史蒂夫
发布于 2017-11-22 06:31:39
找出最相关的二元体,然后删除与其他人的平均关联度最高的一个个体,直到你达到你的阈值,怎么样?假设data =您上面提供的数据,一种方法可能是:
threshold = 0.25 #threshold for relatedness values
pruned_data = data
while(max(pruned_data$relatedness)>threshold){
#find max relatedness estimate
maxrelate = max(pruned_data$relatedness)
#find records that are at the max relatedness
temp = pruned_data[pruned_data$relatedness==maxrelate,,drop=FALSE]
#find average relatedness for only those that are in the "max" group
maxindvs = data.frame(id = unique(c(as.character(temp$ind1.id), as.character(temp$ind2.id))),
mrelate = rep(0, length(unique(c(temp$ind1.id, temp$ind2.id)))))
for(i in 1:nrow(maxindvs)){
itemp = rbind(pruned_data[as.character(pruned_data$ind1.id)==maxindvs[i,1],,drop=FALSE],
pruned_data[as.character(pruned_data$ind2.id)==maxindvs[i,1],,drop=FALSE])
maxindvs$mrelate[i] = mean(itemp$relatedness)
}
#remove individual that is most related to all others
toremove = as.character(maxindvs[maxindvs$mrelate==max(maxindvs$mrelate),1])
pruned_data = pruned_data[as.character(pruned_data$ind1.id)!=toremove & as.character(pruned_data$ind2.id)!=toremove,,drop=FALSE]
}https://stackoverflow.com/questions/46326924
复制相似问题