我有一张遗传变异表,每一行都代表一个病人,病人身上有这种变异,以及该变异是在一个病例中还是在一个对照中。为了执行Fisher's测试,我想输出一个包含三列的单独的矩阵--变体、事例中的数字和控件中的数字。
我使用的是R,表看起来像这样(PID -病人ID)
Variant ID PID Disease
2:4324:2343 FF354 Yes
2:4324:2343 FF355 Control
2:4324:2343 FF356 Control
2:4324:2343 FF357 Yes
2:4324:2343 FF358 Yes
3:346543:345 FF354 Yes
3:346543:345 FF358 Control
3:346543:345 FF390 Control
3:346543:345 FF391 Yes
6:234:34234 FF358 Yes
6:234:34234 FF390 Control
6:234:34234 FF358 Control
6:234:34234 FF213 Yes 预期的产出将是:
Variant ID Disease Control
2:4324:2343 3 2
3:346543:345 2 2
6:234:34234 2 2我想我将不得不在R中使用循环,但我必须承认,这在目前我是超越我,而我可以抓住R。任何帮助将是非常感谢的!
非常感谢
发布于 2020-01-21 17:01:22
你可以使用tapply,它给你一个很好的矩阵。
with(dat, tapply(Disease, list(Variant_ID, Disease), length))
# Control Yes
# 2:4324:2343 2 3
# 3:346543:345 2 2
# 6:234:34234 2 2数据:
dat <- structure(list(Variant_ID = structure(c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("2:4324:2343", "3:346543:345",
"6:234:34234"), class = "factor"), PID = structure(c(2L, 3L,
4L, 5L, 6L, 2L, 6L, 7L, 8L, 6L, 7L, 6L, 1L), .Label = c("FF213",
"FF354", "FF355", "FF356", "FF357", "FF358", "FF390", "FF391"
), class = "factor"), Disease = structure(c(2L, 1L, 1L, 2L, 2L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("Control", "Yes"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))发布于 2020-01-21 16:52:08
我们可以得到频率count,然后把它重塑为“wide”
library(dplyr)
library(tidyr)
df1 %>%
count(VariantID, Disease) %>%
pivot_wider(names_from = Disease, values_from = n)
# A tibble: 3 x 3
# VariantID Control Yes
# <chr> <int> <int>
#1 2:4324:2343 2 3
#2 3:346543:345 2 2
#3 6:234:34234 2 2或者是来自table的base R
table(df1[c('VariantID', 'Disease')])
# Disease
#VariantID Control Yes
# 2:4324:2343 2 3
# 3:346543:345 2 2
# 6:234:34234 2 2数据
df1 <- structure(list(VariantID = c("2:4324:2343", "2:4324:2343", "2:4324:2343",
"2:4324:2343", "2:4324:2343", "3:346543:345", "3:346543:345",
"3:346543:345", "3:346543:345", "6:234:34234", "6:234:34234",
"6:234:34234", "6:234:34234"), PID = c("FF354", "FF355", "FF356",
"FF357", "FF358", "FF354", "FF358", "FF390", "FF391", "FF358",
"FF390", "FF358", "FF213"), Disease = c("Yes", "Control", "Control",
"Yes", "Yes", "Yes", "Control", "Control", "Yes", "Yes", "Control",
"Control", "Yes")), class = "data.frame", row.names = c(NA, -13L
))发布于 2020-01-21 16:53:20
使用来自dcast的data.table
library(data.table)
setDT(df); dcast(df, VariantID ~ Disease)
# VariantID Control Yes
#1 2:4324:2343 2 3
#2 3:346543:345 2 2
#3 6:234:34234 2 2数据
df <- structure(list(VariantID = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("2:4324:2343", "3:346543:345", "6:234:34234"), class = "factor"), PID = structure(c(2L, 3L,4L, 5L, 6L, 2L, 6L, 7L, 8L, 6L, 7L, 6L, 1L), .Label = c("FF213","FF354", "FF355", "FF356", "FF357", "FF358", "FF390", "FF391"), class = "factor"), Disease = structure(c(2L, 1L, 1L, 2L, 2L,2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("Control", "Yes"), class = "factor")), class = "data.frame", row.names = c(NA, -13L))https://stackoverflow.com/questions/59845744
复制相似问题