R:定义一个函数(和/或使用apply()或for循环)重复执行一组过程。
语言:r OS: Windows 7
我想知道如何创建函数和/或构造apply()或for()循环语句,这将使我能够完成下面描述的任务。
我在一台Windows 7机器上工作。sessionInfo()贴在我的问题下面。
我有两个数据,SUBJ和ANNO。我希望通过对SUBJ中的列子集执行操作来创建一个新的dataframe (输出),该列子集由ANNO上的操作结果定义。
下面,我首先创建两个假数据,SUBJ和ANNO。接下来,我使用SUBJ和ANNO的行名和冒号创建空的输出数据。
然后,对ANNO的第一列执行所需的操作。即: 1)我处理ANNO的第一列ANNO1,标识对应于ANNO1==1行的row.names集,并将该集合保存到字符向量ROWSlookup中。2)然后,对于SUBJ中的每一行,计算出现在ROWSlookup列表中的列子集的值和,并将结果和放在Ouptut的ANNO1列中。
实际数据集(由SUBJ和ANNO表示)非常大。因此,我希望创建一个函数和/或构造apply()或for()循环语句,这将使我能够高效地完成所需的输出数据。也就是说,我希望函数为ANNO的每一列创建一个ROWSlookup,计算SUBJ对应列中值的和,并将该和输入到相应的输出单元格中。
# CREATE FAKE SUBJ
SUBJ <- matrix(c(0,0,0,1,0,0,2,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,2,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,2,0,1,0,0,1,0,0,0,2,0,0), 10, 10)`
rownames(SUBJ) <- c("subj1", "subj2", "subj3", "subj4", "subj5", "subj6", "subj7", "subj8", "subj9", "subj10")
colnames(SUBJ) <- c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8", "rs9", "rs10")
SUBJ<- as.data.frame(SUBJ)
SUBJ
#rs1 rs2 rs3 rs4 rs5 rs6 rs7 rs8 rs9 rs10
#subj1 0 1 0 0 1 0 1 1 0 1
#subj2 0 0 0 0 0 0 0 1 1 0
#subj3 0 0 0 0 0 1 0 0 0 0
#subj4 1 1 2 1 1 0 1 0 0 1
#subj5 0 0 0 0 0 0 0 1 0 0
#subj6 0 0 0 0 0 0 0 0 0 0
#subj7 2 0 1 1 0 0 0 0 0 0
#subj8 0 1 0 0 0 0 0 1 0 2
#subj9 1 0 0 0 1 2 0 0 2 0
#subj10 0 0 0 0 0 0 0 0 0 0
# CREATE FAKE ANNO
ANNO <- matrix(c(0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,1,0,1,0),
8, 8)
length(c(0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0))
rownames(ANNO) <- c("rs1", "rs2", "rs3", "rs4", "rs5", "rs6", "rs7", "rs8")
colnames(ANNO) <- c("ANNO1","ANNO2","ANNO3","ANNO4","ANNO5","ANNO6","ANNO7","ANNO8")
ANNO<- as.data.frame(ANNO)
ANNO
#ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
#rs1 0 0 0 0 0 1 1 1
#rs2 0 0 0 0 1 0 1 0
#rs3 0 1 0 1 0 0 0 1
#rs4 1 0 1 0 0 1 0 0
#rs5 1 0 0 0 0 0 0 1
#rs6 0 1 0 0 0 0 0 0
#rs7 0 0 0 0 1 0 0 1
#rs8 0 0 0 0 0 0 0 0
# CREATE EMPTY OUTPUT DATAFRAME TO HOLD THE (EVENTUAL) PROCESSED VALUES
Output<-data.frame(matrix(nrow=nrow(SUBJ), ncol=ncol(ANNO)))
# SET ROWNAMES AND COLNAMES OF OUTPUT DF
row.names(Output)<- row.names(SUBJ)
colnames(Output)<- colnames(ANNO)
Output
#ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
#subj1 NA NA NA NA NA NA NA NA
#subj2 NA NA NA NA NA NA NA NA
#subj3 NA NA NA NA NA NA NA NA
#subj4 NA NA NA NA NA NA NA NA
#subj5 NA NA NA NA NA NA NA NA
#subj6 NA NA NA NA NA NA NA NA
#subj7 NA NA NA NA NA NA NA NA
#subj8 NA NA NA NA NA NA NA NA
#subj9 NA NA NA NA NA NA NA NA
#subj10 NA NA NA NA NA NA NA NA
# PROCESS FIRST COLUMN OF ANNO, ANNO1, IDENTIFYING THE row.names corresponding to rows where ANNO1==1
# SAVE THOSE row.names TO A VECTOR TO SERVE AS LOOKUP VALUES
ROWSlookup <- row.names(ANNO[which(ANNO$ANNO1==1),])
#[1] "rs4" "rs5"
# FOR EACH ROW IN SUBJ, CALCULATE THE SUM OF VALUES WITHIN THE COLs IN ROWSlookup LIST AND PUT THE RESULTING VALUES
# IN THE ANNO1 COL OF THE OUTPUT DF (Count_TEST)
Output$ANNO1 <- apply(SUBJ[,which(names(SUBJ) %in% ROWSlookup)],1,sum,na.rm=TRUE)
Output
#ANNO1 ANNO2 ANNO3 ANNO4 ANNO5 ANNO6 ANNO7 ANNO8
#subj1 1 NA NA NA NA NA NA NA
#subj2 0 NA NA NA NA NA NA NA
#subj3 0 NA NA NA NA NA NA NA
#subj4 2 NA NA NA NA NA NA NA
#subj5 0 NA NA NA NA NA NA NA
#subj6 0 NA NA NA NA NA NA NA
#subj7 1 NA NA NA NA NA NA NA
#subj8 0 NA NA NA NA NA NA NA
#subj9 1 NA NA NA NA NA NA NA
#subj10 0 NA NA NA NA NA NA NA
sessionInfo()
#R version 3.0.3 (2014-03-06)
#Platform: x86_64-w64-mingw32/x64 (64-bit)
#
#locale:
#[1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
#[5] LC_TIME=English_Canada.1252
#
#attached base packages:
#[1] stats4 parallel splines grid stats graphics grDevices utils datasets methods base
#
#other attached packages:
#[1] QuantPsyc_1.5 boot_1.3-13 perturb_2.05 RCurl_1.95-4.5 bitops_1.0-6 car_2.0-22
#[7] reprtree_0.6 plotrix_3.5-10 rpart.plot_1.4-5 sqldf_0.4-7.1 RSQLite.extfuns_0.0.1 RSQLite_1.0.0
#[13] gsubfn_0.6-6 proto_0.3-10 XML_3.98-1.1 RMySQL_0.9-3 DBI_0.3.1 mlbench_2.1-1
#[19] polycor_0.7-8 sfsmisc_1.0-26 quantregForest_0.2-3 tree_1.0-35 maptree_1.4-7 cluster_1.15.3
#[25] mice_2.22 VIM_4.0.0 colorspace_1.2-4 randomForest_4.6-10 ROCR_1.0-5 gplots_2.15.0
#[31] caret_6.0-37 partykit_0.8-0 biomaRt_2.18.0 NCBI2R_1.4.6 snpStats_1.12.0 betareg_3.0-5
#[37] arm_1.7-07 lme4_1.1-7 Rcpp_0.11.3 Matrix_1.1-4 nlme_3.1-118 mvtnorm_1.0-1
#[43] taRifx_1.0.6 sos_1.3-8 brew_1.0-6 R.utils_1.34.0 R.oo_1.18.0 R.methodsS3_1.6.1
#[49] rattle_3.3.0 jsonlite_0.9.13 httpuv_1.3.2 httr_0.5 gmodels_2.15.4.1 ggplot2_1.0.0
#[55] JGR_1.7-16 iplots_1.1-7 JavaGD_0.6-1 party_1.0-18 modeltools_0.2-21 strucchange_1.5-0
#[61] sandwich_2.3-2 zoo_1.7-11 pROC_1.7.3 e1071_1.6-4 psych_1.4.8.11 gtools_3.4.1
#[67] functional_0.6 modeest_2.1 stringi_0.3-1 languageR_1.4.1 utility_1.3 data.table_1.9.4
#[73] xlsx_0.5.7 xlsxjars_0.6.1 rJava_0.9-6 snow_0.3-13 doParallel_1.0.8 iterators_1.0.7
#[79] foreach_1.4.2 reshape2_1.4 reshape_0.8.5 plyr_1.8.1 xtable_1.7-4 stringr_0.6.2
#[85] foreign_0.8-61 Hmisc_3.14-6 Formula_1.1-2 survival_2.37-7 class_7.3-11 MASS_7.3-35
#[91] nnet_7.3-8 Revobase_7.2.0 RevoMods_7.2.0 RevoScaleR_7.2.0 lattice_0.20-27 rpart_4.1-5
#
#loaded via a namespace (and not attached):
#[1] abind_1.4-0 acepack_1.3-3.3 BiocGenerics_0.8.0 BradleyTerry2_1.0-5 brglm_0.5-9 caTools_1.17.1 chron_2.3-45
#[8] coda_0.16-1 codetools_0.2-9 coin_1.0-24 DEoptimR_1.0-2 digest_0.6.4 flexmix_2.3-12 gdata_2.13.3
#[15] glmnet_1.9-8 gtable_0.1.2 KernSmooth_2.23-13 latticeExtra_0.6-26 lmtest_0.9-33 minqa_1.2.4 munsell_0.4.2
#[22] nloptr_1.0.4 pkgXMLBuilder_1.0 png_0.1-7 RColorBrewer_1.0-5 revoIpe_1.0 robustbase_0.92-2 scales_0.2.4
#[29] sp_1.0-16 tcltk_3.0.3 tools_3.0.3 vcd_1.3-2 发布于 2014-12-23 20:36:32
在这里,我们首先可以使用带参数ANNO==1的which和参数arr.ind=TRUE从比较arr.ind=TRUE创建一个行/col数字索引。indx还具有与ANNO数据集相同的rownames。Split indx的行名和indx的second列(column索引),以获得行名列表。此行名可用作SUBJ (相同列名)到子集的列索引。例如,当您执行SUBJ[c('rs1','rs2')]时,结果将是一个只有SUBJ列的子集。类似地,SUBJ[x] (其中x反映拆分行名)将子集SUBJ,因为这些也是SUBJ的列名。然后,在子集数据集上使用rowSums。
indx <- which(ANNO==1,arr.ind=TRUE)
Output[] <- lapply(split(row.names(indx), indx[,2]),
function(x) rowSums(SUBJ[x], na.rm=TRUE))或者,我们也可以使用lapply,而不是使用Map。这个想法也是一样的。y的每个y元素将是split行名,x将是整个SUBJ数据集。
Output[] <- Map(function(x,y) rowSums(x[y], na.rm=TRUE),
list(SUBJ),split(row.names(indx), indx[,2]))data.frame也是一个list,但其元素长度相同。因此,通过使用Output[] (它具有相同的dim of SUBJ),结果将是一个data.frame,同时保持Output的结构不变。
https://stackoverflow.com/questions/27627505
复制相似问题