文章/答案/技术大牛

发布

社区首页 >问答首页 >如何为独立的2样本t检验(x，y)编写代码

问如何为独立的2样本t检验(x，y)编写代码
EN

Stack Overflow用户

提问于 2015-12-08 02:53:19

回答 1查看 263关注 0票数 0

我需要为x和y编写一个算法，并进行两个样本t-测试(不使用内置函数)。

x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)

适用于两个样本x和y的双样本t检验。这个函数的参数应该包括: x，y，假设的平均差delta0，以及一个指示左尾、右尾或双尾检验的选项。

我怎么用R做这件事？我需要一个代码，而不仅仅是内置功能。

到目前为止，我已经这样做了，但是我需要我的t_test函数来返回两个样本的样本大小和样本平均值，它的自由度和p值。

x = c(2,4,6,8,9,10,12,14)
y = c(3,5,7,9,12,13,15,18)
nx = length(x)
ny = length(y)
sp = sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
s1 = sp*sqrt(1/nx+1/ny)
mu0 = 0
t = (sample.mean(x)-sample.mean(y)-mu0) / s1
t

这是两个数据集。文件“expr_data”包含17名乳腺癌患者的微阵列基因表达数据，每一组的ID字符串以“GSM”开头。该数据文件中的每一行代表一个基因(Probeset)。这17名患者属于三个不同的治疗组:对照组、Treatment1和Treatment2，其组成员在"group_data“中指定。

数据集1

> head(expr_data)
          GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949
1007_s_at 11.376519 11.826743 11.123022 11.743439 12.172961 11.522009
1053_at    7.270398  7.534450  7.169297  7.730833  6.728914  7.033900
117_at     8.172823  8.350568  8.216073  8.052177  7.940714  8.122496
121_at    10.064195 11.193688  9.846189 10.549992 10.172722 10.357284
1255_g_at  6.256425  6.830607  5.825010  6.098157  6.104971  5.818458
1294_at    9.347887  9.540260  9.229501  9.464348  9.764903  9.962180
          GSM119950 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955
1007_s_at 11.288407 11.364544 11.783231 12.102697 12.141934 12.141672
1053_at    8.152550  7.357942  7.811469  6.704366  7.723678  7.607720
117_at     8.246269  7.597745  8.809971  7.299070  7.808597  8.390707
121_at    11.361081 10.446139 11.165541 10.285435 10.123556 10.532735
1255_g_at  6.355995  6.311312  7.366574  5.577412  4.570794  5.046956
1294_at    9.300450  9.230649  9.783263  8.749285  9.466965  9.653450
          GSM119956 GSM119957 GSM119958 GSM119959 GSM119960 GeneSymbol
1007_s_at 11.541161 12.069206 11.529456  9.692066 11.242988       DDR1
1053_at    6.904579  6.837490  7.437899  7.608960  6.704648       RFC2
117_at     7.653514  8.680945  8.050873  9.242006  8.253535      HSPA6
121_at    10.379335 10.487541 10.542419 10.248043 10.207259       PAX8
1255_g_at  6.561945  5.897955  5.402725  5.957542  6.201037     GUCA1A
1294_at    9.076623  9.827835  9.096732  9.441370  9.102000       UBA7
          PublicID
1007_s_at   U48705
1053_at     M87338
117_at      X51757
121_at      X69699
1255_g_at   L36861
1294_at     L13852

数据集2

> groups_data
   PatientID TreatmentGroup
1  GSM119946        Control
2  GSM119948        Control
3  GSM119951        Control
4  GSM119955        Control
5  GSM119956        Control
6  GSM119959        Control
7  GSM119947     Treatment1
8  GSM119950     Treatment1
9  GSM119952     Treatment1
10 GSM119953     Treatment1
11 GSM119957     Treatment1
12 GSM119958     Treatment1
13 GSM119944     Treatment2
14 GSM119945     Treatment2
15 GSM119949     Treatment2
16 GSM119954     Treatment2
17 GSM119960     Treatment2

使用我正在编写的两个样本t检验函数，我需要单独测试所有基因(对照病人组和治疗1患者组)，一个无效假设是mu_control=mu_treat1，另一个假设是mu_control < mu_treat1。

以下是合并的两个数据集(如果这有帮助的话)

总目(Groups_expr) GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 1 11.376519 11.826743 11.123022 11.743439 12.172961 11.288407 2 7.270398 7.169297 7.730833 6.728914 7.033900 8.152550 3 8.172823 8.350568 8.216073 8.052177 7.940714 8.122496 8.246269 4 10.064195 11.1988 9.846189 10.549992 10.1722 10.357284 11.361081 5 6.256425 6.830607 5.830607 5.825010 6.098157 6.104971 5。818458 6.355995 6 9.347887 9.229501 9.464348 9.764903 9.300450 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 1 11.364544 11.783231 12.102697 12.141934 12.141672 11.541161 12.069206 2 7.357942 7.811469 6.704366 7.723678 7.607720 6.9079 6.837490 3 7.597745 8.809971 7.299070 7.8097 8.3907 7.6514 8.680945 4 10.446139 11.165541 10.285435 10.123510.123556 10.379335 10.2735 10.379335 10.487541 5 6.311312 7.366574 5.577412 4.570794 5.046956 6.561945 5.897955 6 9.230649 9.783263 8.749285 9.466965 9.076623 9.827835 GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID TreatmentGroup 1 11.529456 11.242966 11.242988 DDR1 U48705 GSM119946 Control 2 7.437899 7.608960 6.704648 RFC2 M87338 GSM119946 Control 3 8.050873 9.242006 8.253535 HSPA6 X51757 GSM119946 Control 4 10.54241910.248043 10.207259 PAX8 X69699 GSM119946 Control 5 5.402725 5.957542 6.201037 GUCA1A L36861 GSM119946 Control 6 9.096732 9.441370 9.102000 UBA7 L13852 GSM119946控件

尾(Groups_expr) GSM119944 GSM119945 GSM119946 GSM119947 GSM119948 GSM119949 GSM119950 378806 4.671951 4.731546 3.364612 2.893266 2.450373 4.375824 378807 2.954090 2.954090 4.653969 3.685073 3.9202165 5.387476 378808 3.159816 5.216588 3.989162 5.387770 5.579206 5.9640708 4.796789 378809 1.464918 1.892150 1.398225 1.780359 1.477039 0.8636322 5.2179 378810 3.567588 3.6488 3.6495 5.6495 5.0016 3.565525 4190032 3.2998454 4.903368 2.959766 3.164650 1.462571 2.681616 2.646549 3.3482051 GSM119951 GSM119952 GSM119953 GSM119954 GSM119955 GSM119956 GSM119957 378806 3.317340 5.957501 3.072479 3.395843 3.183937 3.332907 378807 4.008853 3.8083 3.8073 3.356645 3.979238 3.327875 3.143567 3.500472 378808 2.468878 4.937979 3.568130 3.105428 3.978494 3.431517 5.48591 378809 4.898809 4.893662 2.465712 1.7586 1.632630 1。GSM119958 GSM119959 GSM119960 GeneSymbol PublicID PatientID 378806 4.230085 3.740862 2.963901 AFFX-3 GSM119960 378807 3.405755 3.703066 3.421292 AFFX-ThrX-5 GSM119960 378808 4.3355 5.543589 3.771600GSM119960 378809 4.217012 2.025573 2.080592 AFFX-TrpnX-3 GSM119960 378810 5.254337 3.054821 4.731657 AFFX-TrpnX-5 GSM119960 378811 2.621702 1.619972 AFFX-M GSM119960 TreatmentGroup 378806 Treatment2 378807 Treatment2 378808 Treatment2 378809 Treatment2 378810 Treatment2 378811 Treatment2

有378811行，我需要测试所有这些行(基因)来比较Treatment1和对照病人(GSM*是病人Id)。

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-12-08 03:35:51

有关公式引用，请参见这里。

x <- c(2,4,6,8,9,10,12,14)
y <- c(3,5,7,9,12,13,15,18)

tt <- function(x,y,mu0=0,ts=TRUE) { # two-sided t-test
  nx <- length(x)
  ny <- length(y)
  sp <- sqrt(((nx-1)*var(x)+(ny-1)*var(y))/(nx+ny-2))
  t <- (mean(x)-mean(y)-mu0) / (sp*sqrt(1/nx+1/ny))
  df <- (var(x)/nx + var(y)/ny)^2 / 
    ((var(x)/nx)^2/(nx-1) + (var(y)/ny)^2/(ny-1))

  sample_sizes <- c(nx, ny)
  names(sample_sizes) <- c("x","y")
  sample_means <- c(mean(x), mean(y))
  names(sample_means) <- c("x", "y")
  pvalue <- ifelse(ts,2,1)*(1-pt(abs(t),df=df))

  list(sample_sizes=sample_sizes, sample_means=sample_means,
       df=df, pvalue=pvalue)
}

tt(x,y)

# $sample_sizes
# x y 
# 8 8 
#
# $sample_means
#     x      y 
# 8.125 10.250 
#
# $df
# [1] 13.21697
#
# $pvalue
# [1] 0.3737564

编辑

下面是如何在示例中使用此函数的方法。这是第一个数据集。

expr_data <- data.frame(matrix(
c("1007_s_at", "11.376519", "11.826743", "11.123022", "11.743439", "12.172961", 
"11.522009", "11.288407", "11.364544", "11.783231", "12.102697", "12.141934", 
"12.141672", "11.541161", "12.069206", "11.529456", "9.692066", "11.242988", 
"DDR1", "U48705", "1053_at", "7.270398", "7.534450", "7.169297", "7.730833", 
"6.728914", "7.033900", "8.152550", "7.357942", "7.811469", "6.704366",  
"7.723678", "7.607720", "6.904579", "6.837490", "7.437899", "7.608960", 
"6.704648", "RFC2", "M87338", "117_at", "8.172823", "8.350568", "8.216073", 
"8.052177", "7.940714", "8.122496", "8.246269", "7.597745", "8.809971", 
"7.299070", "7.808597", "8.390707", "7.653514", "8.680945", "8.050873", 
"9.242006", "8.253535", "HSPA6", "X51757", "121_at", "10.064195", 
"11.193688", "9.846189", "10.549992", "10.172722", "10.357284", "11.361081", 
"10.446139", "11.165541", "10.285435", "10.123556", "10.532735", 
"10.379335", "10.487541", "10.542419", "10.248043", "10.207259", "PAX8", 
"X69699", "1255_g_at", "6.256425", "6.830607", "5.825010", "6.098157", 
"6.104971", "5.818458", "6.355995", "6.311312", "7.366574", "5.577412", 
"4.570794", "5.046956", "6.561945", "5.897955", "5.402725", "5.957542", 
"6.201037", "GUCA1A", "L36861", "1294_at", "9.347887", "9.540260", 
"9.229501", "9.464348", "9.764903", "9.962180", "9.300450", "9.230649", 
"9.783263","8.749285", "9.466965", "9.653450", "9.076623", "9.827835", 
"9.096732", "9.441370", "9.102000", "UBA7", "L13852"), nrow=6, byrow=TRUE))

row.names(expr_data) <- expr_data[,1]
expr_data <- expr_data[,-1]
names(expr_data) <- c("GSM119944", "GSM119945", "GSM119946", "GSM119947", 
                      "GSM119948", "GSM119949", "GSM119950", "GSM119951", 
                      "GSM119952", "GSM119953", "GSM119954", "GSM119955", 
                      "GSM119956", "GSM119957", "GSM119958", "GSM119959", 
                      "GSM119960", "GeneSymbol", "PublicID")
expr_data[,1:17] <- sapply(expr_data[,1:17], function(x)
  as.numeric(as.character(x)))

和第二个数据集。

groups_data <- data.frame(
  PatientID=c('GSM119946','GSM119948','GSM119951','GSM119955','GSM119956',
              'GSM119959','GSM119947','GSM119950','GSM119952','GSM119953',
              'GSM119957','GSM119958','GSM119944','GSM119945','GSM119949',
              'GSM119954','GSM119960'),
  TreatmentGroup = c(rep('Control',6), rep('Treatment1',6), 
                     rep('Treatment2',5))
)

然后，执行适当的测试。

control_index <- which(groups_data$TreatmentGroup=="Control")
treatment_index <- which(groups_data$TreatmentGroup=="Treatment1")
# assume length(control_index) = length(treatment_index)
for(i in 1:length(control_index)) {
  control_group <- expr_data[,groups_data$PatientID[control_index[i]]]
  treatment_group <- expr_data[,groups_data$PatientID[treatment_index[i]]]
  cat("T-test for", as.character(groups_data$PatientID[control_index[i]]), "and", 
      as.character(groups_data$PatientID[treatment_index[i]]), "\n")
  result <- tt(control_group, treatment_group, 0, FALSE)
  cat("  sample sizes:", as.numeric(result$sample_sizes),"\n")
  cat("  sample means:", as.numeric(result$sample_means),"\n")
  cat("  degrees of freedom:", as.numeric(result$df),"\n")
  cat("  p-value:", as.numeric(result$pvalue),"\n\n")
}

输出

T-test for GSM119946 and GSM119947 
sample sizes: 6 6 
sample means: 8.568182 8.939824 
degrees of freedom: 9.947607 
p-value: 0.3759984 

T-test for GSM119948 and GSM119950 
sample sizes: 6 6 
sample means: 8.814198 9.117459 
degrees of freedom: 9.744115 
p-value: 0.4053794 

T-test for GSM119951 and GSM119952 
sample sizes: 6 6 
sample means: 8.718055 9.453342 
degrees of freedom: 9.916609 
p-value: 0.2560534 

T-test for GSM119955 and GSM119953 
sample sizes: 6 6 
sample means: 8.89554 8.453044 
degrees of freedom: 9.996676 
p-value: 0.38034 

T-test for GSM119956 and GSM119957 
sample sizes: 6 6 
sample means: 8.686193 8.966829 
degrees of freedom: 9.792449 
p-value: 0.4132705 

T-test for GSM119959 and GSM119958 
sample sizes: 6 6 
sample means: 8.698331 8.676684 
degrees of freedom: 9.134459 
p-value: 0.492472

编辑2

如果要测试行，可以使用以下代码。

control_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Control")]
treatment_cols <- groups_data$PatientID[which(groups_data$TreatmentGroup=="Treatment1")]

nrows <- dim(expr_data)[1]
for(i in 1:nrows) {
  control_group <- expr_data[i,control_cols]
  treatment_group <- expr_data[i,treatment_cols]

  cat("T-test for control vs treatment (", row.names(treatment_group), ")\n")
  result <- tt(as.numeric(control_group), as.numeric(treatment_group), 0, FALSE)
  cat("  sample sizes:", as.numeric(result$sample_sizes),"\n")
  cat("  sample means:", as.numeric(result$sample_means),"\n")
  cat("  degrees of freedom:", as.numeric(result$df),"\n")
  cat("  p-value:", as.numeric(result$pvalue),"\n\n")
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/34147163

复制

相似问题

问如何为独立的2样本t检验(x，y)编写代码
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何为独立的2样本t检验(x，y)编写代码EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何为独立的2样本t检验(x，y)编写代码
EN