我需要为x和y编写一个算法,并进行两个样本t-测试(不使用内置函数)。
x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)适用于两个样本x和y的双样本t检验。这个函数的参数应该包括: x,y,假设的平均差delta0,以及一个指示左尾、右尾或双尾检验的选项。
我怎么用R做这件事?我需要一个代码,而不仅仅是内置功能。
到目前为止,我已经这样做了,但是我需要我的t_test函数来返回两个样本的样本大小和样本平均值,它的自由度和p值。
x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)
nx = length(x)
ny = length(y)
sp = sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
s1 = sp*sqrt(1/nx+1/ny)
mu0 = 0
t = (sample.mean(x)-sample.mean(y)-mu0) / s1
t这是两个数据集。文件“expr_data”包含17名乳腺癌患者的微阵列基因表达数据,每一组的ID字符串以“GSM”开头。该数据文件中的每一行代表一个基因(Probeset)。这17名患者属于三个不同的治疗组:对照组、Treatment1和Treatment2,其组成员在"group_data“中指定。
数据集1
> head(expr_data)
GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949
1007_s_at 11.376519 11.826743 11.123022 11.743439 12.172961 11.522009
1053_at 7.270398 7.534450 7.169297 7.730833 6.728914 7.033900
117_at 8.172823 8.350568 8.216073 8.052177 7.940714 8.122496
121_at 10.064195 11.193688 9.846189 10.549992 10.172722 10.357284
1255_g_at 6.256425 6.830607 5.825010 6.098157 6.104971 5.818458
1294_at 9.347887 9.540260 9.229501 9.464348 9.764903 9.962180
GSM119950 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955
1007_s_at 11.288407 11.364544 11.783231 12.102697 12.141934 12.141672
1053_at 8.152550 7.357942 7.811469 6.704366 7.723678 7.607720
117_at 8.246269 7.597745 8.809971 7.299070 7.808597 8.390707
121_at 11.361081 10.446139 11.165541 10.285435 10.123556 10.532735
1255_g_at 6.355995 6.311312 7.366574 5.577412 4.570794 5.046956
1294_at 9.300450 9.230649 9.783263 8.749285 9.466965 9.653450
GSM119956 GSM119957 GSM119958 GSM119959 GSM119960 GeneSymbol
1007_s_at 11.541161 12.069206 11.529456 9.692066 11.242988 DDR1
1053_at 6.904579 6.837490 7.437899 7.608960 6.704648 RFC2
117_at 7.653514 8.680945 8.050873 9.242006 8.253535 HSPA6
121_at 10.379335 10.487541 10.542419 10.248043 10.207259 PAX8
1255_g_at 6.561945 5.897955 5.402725 5.957542 6.201037 GUCA1A
1294_at 9.076623 9.827835 9.096732 9.441370 9.102000 UBA7
PublicID
1007_s_at U48705
1053_at M87338
117_at X51757
121_at X69699
1255_g_at L36861
1294_at L13852数据集2
> groups_data
PatientID TreatmentGroup
1 GSM119946 Control
2 GSM119948 Control
3 GSM119951 Control
4 GSM119955 Control
5 GSM119956 Control
6 GSM119959 Control
7 GSM119947 Treatment1
8 GSM119950 Treatment1
9 GSM119952 Treatment1
10 GSM119953 Treatment1
11 GSM119957 Treatment1
12 GSM119958 Treatment1
13 GSM119944 Treatment2
14 GSM119945 Treatment2
15 GSM119949 Treatment2
16 GSM119954 Treatment2
17 GSM119960 Treatment2使用我正在编写的两个样本t检验函数,我需要单独测试所有基因(对照病人组和治疗1患者组),一个无效假设是mu_control=mu_treat1,另一个假设是mu_control < mu_treat1。
以下是合并的两个数据集(如果这有帮助的话)
总目(Groups_expr) GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 1 11.376519 11.826743 11.123022 11.743439 12.172961 11.288407 2 7.270398 7.169297 7.730833 6.728914 7.033900 8.152550 3 8.172823 8.350568 8.216073 8.052177 7.940714 8.122496 8.246269 4 10.064195 11.1988 9.846189 10.549992 10.1722 10.357284 11.361081 5 6.256425 6.830607 5.830607 5.825010 6.098157 6.104971 5。818458 6.355995 6 9.347887 9.229501 9.464348 9.764903 9.300450 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 1 11.364544 11.783231 12.102697 12.141934 12.141672 11.541161 12.069206 2 7.357942 7.811469 6.704366 7.723678 7.607720 6.9079 6.837490 3 7.597745 8.809971 7.299070 7.8097 8.3907 7.6514 8.680945 4 10.446139 11.165541 10.285435 10.123510.123556 10.379335 10.2735 10.379335 10.487541 5 6.311312 7.366574 5.577412 4.570794 5.046956 6.561945 5.897955 6 9.230649 9.783263 8.749285 9.466965 9.076623 9.827835 GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID TreatmentGroup 1 11.529456 11.242966 11.242988 DDR1 U48705 GSM119946 Control 2 7.437899 7.608960 6.704648 RFC2 M87338 GSM119946 Control 3 8.050873 9.242006 8.253535 HSPA6 X51757 GSM119946 Control 4 10.54241910.248043 10.207259 PAX8 X69699 GSM119946 Control 5 5.402725 5.957542 6.201037 GUCA1A L36861 GSM119946 Control 6 9.096732 9.441370 9.102000 UBA7 L13852 GSM119946控件
尾(Groups_expr) GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 378806 4.671951 4.731546 3.364612 2.893266 2.450373 4.375824 378807 2.954090 2.954090 4.653969 3.685073 3.9202165 5.387476 378808 3.159816 5.216588 3.989162 5.387770 5.579206 5.9640708 4.796789 378809 1.464918 1.892150 1.398225 1.780359 1.477039 0.8636322 5.2179 378810 3.567588 3.6488 3.6495 5.6495 5.0016 3.565525 4190032 3.2998454 4.903368 2.959766 3.164650 1.462571 2.681616 2.646549 3.3482051 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 378806 3.317340 5.957501 3.072479 3.395843 3.183937 3.332907 378807 4.008853 3.8083 3.8073 3.356645 3.979238 3.327875 3.143567 3.500472 378808 2.468878 4.937979 3.568130 3.105428 3.978494 3.431517 5.48591 378809 4.898809 4.893662 2.465712 1.7586 1.632630 1。GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID 378806 4.230085 3.740862 2.963901 AFFX-3 GSM119960 378807 3.405755 3.703066 3.421292 AFFX-ThrX-5 GSM119960 378808 4.3355 5.543589 3.771600GSM119960 378809 4.217012 2.025573 2.080592 AFFX-TrpnX-3 GSM119960 378810 5.254337 3.054821 4.731657 AFFX-TrpnX-5 GSM119960 378811 2.621702 1.619972 AFFX-M GSM119960 TreatmentGroup 378806 Treatment2 378807 Treatment2 378808 Treatment2 378809 Treatment2 378810 Treatment2 378811 Treatment2
有378811行,我需要测试所有这些行(基因)来比较Treatment1和对照病人(GSM*是病人Id)。
发布于 2015-12-08 03:35:51
有关公式引用,请参见这里。
x <- c(2,4,6,8,9,10,12,14)
y <- c(3,5,7,9,12,13,15,18)
tt <- function(x,y,mu0=0,ts=TRUE) { # two-sided t-test
nx <- length(x)
ny <- length(y)
sp <- sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
t <- (mean(x)-mean(y)-mu0) / (sp*sqrt(1/nx+1/ny))
df <- (var(x)/nx + var(y)/ny)^2 /
((var(x)/nx)^2/(nx-1) + (var(y)/ny)^2/(ny-1))
sample_sizes <- c(nx, ny)
names(sample_sizes) <- c("x","y")
sample_means <- c(mean(x), mean(y))
names(sample_means) <- c("x", "y")
pvalue <- ifelse(ts,2,1)*(1-pt(abs(t),df=df))
list(sample_sizes=sample_sizes, sample_means=sample_means,
df=df, pvalue=pvalue)
}
tt(x,y)
# $sample_sizes
# x y
# 8 8
#
# $sample_means
# x y
# 8.125 10.250
#
# $df
# [1] 13.21697
#
# $pvalue
# [1] 0.3737564编辑
下面是如何在示例中使用此函数的方法。这是第一个数据集。
expr_data <- data.frame(matrix(
c("1007_s_at", "11.376519", "11.826743", "11.123022", "11.743439", "12.172961",
"11.522009", "11.288407", "11.364544", "11.783231", "12.102697", "12.141934",
"12.141672", "11.541161", "12.069206", "11.529456", "9.692066", "11.242988",
"DDR1", "U48705", "1053_at", "7.270398", "7.534450", "7.169297", "7.730833",
"6.728914", "7.033900", "8.152550", "7.357942", "7.811469", "6.704366",
"7.723678", "7.607720", "6.904579", "6.837490", "7.437899", "7.608960",
"6.704648", "RFC2", "M87338", "117_at", "8.172823", "8.350568", "8.216073",
"8.052177", "7.940714", "8.122496", "8.246269", "7.597745", "8.809971",
"7.299070", "7.808597", "8.390707", "7.653514", "8.680945", "8.050873",
"9.242006", "8.253535", "HSPA6", "X51757", "121_at", "10.064195",
"11.193688", "9.846189", "10.549992", "10.172722", "10.357284", "11.361081",
"10.446139", "11.165541", "10.285435", "10.123556", "10.532735",
"10.379335", "10.487541", "10.542419", "10.248043", "10.207259", "PAX8",
"X69699", "1255_g_at", "6.256425", "6.830607", "5.825010", "6.098157",
"6.104971", "5.818458", "6.355995", "6.311312", "7.366574", "5.577412",
"4.570794", "5.046956", "6.561945", "5.897955", "5.402725", "5.957542",
"6.201037", "GUCA1A", "L36861", "1294_at", "9.347887", "9.540260",
"9.229501", "9.464348", "9.764903", "9.962180", "9.300450", "9.230649",
"9.783263","8.749285", "9.466965", "9.653450", "9.076623", "9.827835",
"9.096732", "9.441370", "9.102000", "UBA7", "L13852"), nrow=6, byrow=TRUE))
row.names(expr_data) <- expr_data[,1]
expr_data <- expr_data[,-1]
names(expr_data) <- c("GSM119944", "GSM119945", "GSM119946", "GSM119947",
"GSM119948", "GSM119949", "GSM119950", "GSM119951",
"GSM119952", "GSM119953", "GSM119954", "GSM119955",
"GSM119956", "GSM119957", "GSM119958", "GSM119959",
"GSM119960", "GeneSymbol", "PublicID")
expr_data[,1:17] <- sapply(expr_data[,1:17], function(x)
as.numeric(as.character(x))) 和第二个数据集。
groups_data <- data.frame(
PatientID=c('GSM119946','GSM119948','GSM119951','GSM119955','GSM119956',
'GSM119959','GSM119947','GSM119950','GSM119952','GSM119953',
'GSM119957','GSM119958','GSM119944','GSM119945','GSM119949',
'GSM119954','GSM119960'),
TreatmentGroup = c(rep('Control',6), rep('Treatment1',6),
rep('Treatment2',5))
)然后,执行适当的测试。
control_index <- which(groups_data$TreatmentGroup=="Control")
treatment_index <- which(groups_data$TreatmentGroup=="Treatment1")
# assume length(control_index) = length(treatment_index)
for(i in 1:length(control_index)) {
control_group <- expr_data[,groups_data$PatientID[control_index[i]]]
treatment_group <- expr_data[,groups_data$PatientID[treatment_index[i]]]
cat("T-test for", as.character(groups_data$PatientID[control_index[i]]), "and",
as.character(groups_data$PatientID[treatment_index[i]]), "\n")
result <- tt(control_group, treatment_group, 0, FALSE)
cat(" sample sizes:", as.numeric(result$sample_sizes),"\n")
cat(" sample means:", as.numeric(result$sample_means),"\n")
cat(" degrees of freedom:", as.numeric(result$df),"\n")
cat(" p-value:", as.numeric(result$pvalue),"\n\n")
}输出
T-test for GSM119946 and GSM119947
sample sizes: 6 6
sample means: 8.568182 8.939824
degrees of freedom: 9.947607
p-value: 0.3759984
T-test for GSM119948 and GSM119950
sample sizes: 6 6
sample means: 8.814198 9.117459
degrees of freedom: 9.744115
p-value: 0.4053794
T-test for GSM119951 and GSM119952
sample sizes: 6 6
sample means: 8.718055 9.453342
degrees of freedom: 9.916609
p-value: 0.2560534
T-test for GSM119955 and GSM119953
sample sizes: 6 6
sample means: 8.89554 8.453044
degrees of freedom: 9.996676
p-value: 0.38034
T-test for GSM119956 and GSM119957
sample sizes: 6 6
sample means: 8.686193 8.966829
degrees of freedom: 9.792449
p-value: 0.4132705
T-test for GSM119959 and GSM119958
sample sizes: 6 6
sample means: 8.698331 8.676684
degrees of freedom: 9.134459
p-value: 0.492472 编辑2
如果要测试行,可以使用以下代码。
control_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Control")]
treatment_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Treatment1")]
nrows <- dim(expr_data)[1]
for(i in 1:nrows) {
control_group <- expr_data[i,control_cols]
treatment_group <- expr_data[i,treatment_cols]
cat("T-test for control vs treatment (", row.names(treatment_group), ")\n")
result <- tt(as.numeric(control_group), as.numeric(treatment_group), 0, FALSE)
cat(" sample sizes:", as.numeric(result$sample_sizes),"\n")
cat(" sample means:", as.numeric(result$sample_means),"\n")
cat(" degrees of freedom:", as.numeric(result$df),"\n")
cat(" p-value:", as.numeric(result$pvalue),"\n\n")
}https://stackoverflow.com/questions/34147163
复制相似问题