下面是示例数据,在genomicmatrix中,每一行对应于一个基因("sample"),每个细胞对应于一个病人的该基因的值(以"TCGA-__-____-__"格式命名)。(问题仍在下文)
genomicmatrix <- data.frame("sample" = c("BIX","HEF","TUR","ZOP","VAG"),
"TCGA-K4-6303-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-HB-KH8H-01" = runif(5, -1, 1),
"TCGA-Y7-PIK2-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(genomicmatrix) <- gsub("[.]", "_",colnames(genomicmatrix))
sample = NULL
sample <- genomicmatrix$sample
genomicmatrix$sample = NULL
means = NULL
for(z in 1:nrow(genomicmatrix)) {
means[z] <- rowMeans(genomicmatrix[z,])
}
genemeans <- data.frame(sample, means)所以,在找到上面每一行(基因)的平均值之后,我提取出了这个基因的值大于这个基因的平均值的病人的名字。对于每个基因的“大于”患者,都会在一个名为high的元素中列出该基因(例如,对于第四个基因,“大于”的患者出现在high[[4]]中)。“小于”患者的情况也是如此,他们使用一个名为“low”的元素。
high = NULL
low = NULL
high <- list(list())
low <- list(list())
uplist = NULL
downlist = NULL
for (i in 1:nrow(genomicmatrix)) {
uplist = NULL
downlist = NULL
for (y in seq_along(genomicmatrix)) {
uplist[y] <- ifelse(genomicmatrix[i,y] > genemeans$means[i], names(genomicmatrix[y]), "")
downlist[y] <- ifelse(genomicmatrix[i,y] < genemeans$means[i], names(genomicmatrix[y]), "")
high[[i]] <- uplist
low[[i]] <- downlist
}
}所以,对于每个基因,我把病人分成“高表达”和“低表达”两类。例如,出现在low[[3]]中的患者是那些第三种基因("TUR")的表达低于该基因平均水平的患者。下面,我有一个转换表patientID -到生存时间,以天为单位。
survival = NULL
survival$sampleID <- c("TCGA-K4-6303-01", "TCGA-DM-A28E-01", "TCGA-AY-6197-01", "TCGA-F4-6703-01", "TCGA-HB-KH8H-01", "TCGA-Y7-PIK2-01", "TCGA-A6-5657-01")
survival$X_OS <- c(256, 26, 88, 491, 553, 177, 732)
survival$sampleID <- chartr("-", "_", survival$sampleID)我想,从这个设置中,提取每个基因的日志秩测试值。例如,对于基因1 ("BIX"),鉴于高表达和低表达的Kaplan生存曲线(即high[[1]]对low[[1]]),我希望从这两个载体的对数秩测试中提取相应的p值(回答以下问题:高表达和低表达患者的生存结果是否存在显著差异?)。一旦这个pvalue被推导出来,它当然会转移到下一个基因上。
发布于 2015-03-03 18:44:12
(如果您只要求一个工具来执行操作或提供统计建议,那么StackOverflow可能不是这个问题的合适位置。)
尽管如此,我还是建议对您的数据和代码的格式进行一些改进,这将有助于在R中实现您的目标。如果您没有明显的内存限制,可以将您的“基因组矩阵”转换为长格式的"data.frame":
longDF = reshape(genomicmatrix, direction = "long", idvar = "sample",
varying = list(2:8), times = colnames(genomicmatrix[-1]),
timevar = "ID", v.names = "value")
row.names(longDF) = NULL
head(longDF)
# sample ID value
#1 BIX TCGA_K4_6303_01 -0.4811441
#2 HEF TCGA_K4_6303_01 -0.2665017
#3 TUR TCGA_K4_6303_01 0.8367469
#4 ZOP TCGA_K4_6303_01 -0.5868480
#5 VAG TCGA_K4_6303_01 -0.0319600
#6 BIX TCGA_DM_A28E_01 0.3435170然后你可以找出哪些病人的表达高于和低于平均水平,并创建一个"data.frame":
exprs = do.call(rbind,
lapply(split(longDF, longDF$sample),
function(x) {
x$expr = ifelse(findInterval(x$value, mean(x$value)) == 1,
"high",
"low")
x
}))
row.names(exprs) = NULL
head(exprs)
# sample ID value expr
#1 BIX TCGA_K4_6303_01 -0.4811441 low
#2 BIX TCGA_DM_A28E_01 0.3435170 high
#3 BIX TCGA_AY_6197_01 0.2269158 high
#4 BIX TCGA_F4_6703_01 -0.8283441 low
#5 BIX TCGA_HB_KH8H_01 0.4024671 high
#6 BIX TCGA_Y7_PIK2_01 -0.2979979 low然后添加"survival$X_OS":
exprs$X_OS = survival$X_OS[match(exprs$ID, survival$sampleID)]
head(exprs)
# sample ID value expr X_OS
#1 BIX TCGA_K4_6303_01 -0.4811441 low 256
#2 BIX TCGA_DM_A28E_01 0.3435170 high 26
#3 BIX TCGA_AY_6197_01 0.2269158 high 88
#4 BIX TCGA_F4_6703_01 -0.8283441 low 491
#5 BIX TCGA_HB_KH8H_01 0.4024671 high 553
#6 BIX TCGA_Y7_PIK2_01 -0.2979979 low 177然后,假设您有一个函数log_rank_test,它接受两个向量并输出一个"p.value“,您可以使用以下内容:
#lapply(split(exprs[c("expr", "X_OS")], exprs$sample),
# function(x) log_rank_test(x$X_OS[x$expr == "high"], x$X_OS[x$expr == "low"]))我正在尝试一种"data.table“方法,尽管它可能不是惯用的,也可能是可以改进的,因为我不熟悉它:
library(data.table)
library(reshape2)
DT = as.data.table(genomicmatrix)
longDT = melt(DT, "sample", variable.name = "ID")
longDT[, expr := ifelse(findInterval(value, mean(value)) == 1, "high", "low"), by = sample]
longDT[, X_OS := survival$X_OS[match(ID, survival$sampleID)]]
head(longDT)
# sample ID value expr X_OS
#1: BIX TCGA_K4_6303_01 -0.4811441 low 256
#2: HEF TCGA_K4_6303_01 -0.2665017 low 256
#3: TUR TCGA_K4_6303_01 0.8367469 high 256
#4: ZOP TCGA_K4_6303_01 -0.5868480 low 256
#5: VAG TCGA_K4_6303_01 -0.0319600 low 256
#6: BIX TCGA_DM_A28E_01 0.3435170 high 26然后,运行log_rank_test函数,如下所示:
#longDT[, log_rank_test(X_OS[expr == "high"], X_OS[expr == "low"]), by = sample]https://stackoverflow.com/questions/28837811
复制相似问题