在我在net (http://www.slideshare.net/schamber/phylogenetics-in-r)中找到的这个脚本中,我生成了一个棉花列表。
require(ape)
# make vector of accession numbers, for ITS 1 and 2 region for Gossypium (cotton) species
cotton_acc <- c("U56806", "U12712", "U56810",
"U12732", "U12725", "U56786", "U12715",
"AF057758", "U56790", "U12716", "U12729",
"U56798", "U12727", "U12713", "U12719",
"U56811", "U12728", "U12730", "U12731",
"U12722", "U56796", "U12714", "U56789",
"U56797", "U56801", "U56802", "U12718",
"U12710", "U56804", "U12734", "U56809",
"U56812", "AF057753", "U12711", "U12717",
"U12723", "U12726")
# get data from Genbank
cotton <- read.GenBank(cotton_acc, species.names = T)
# name the sequences with species names instead of access numbers
names_accs <- data.frame(species = attr(cotton, "species"), accs = names(cotton))
names(cotton) <- attr(cotton, "species")
write.dna(cotton, "C:/Users/Comp12/Desktop/cotton.fas", format = "fasta") 输出:
> cotton
37 DNA sequences in binary format stored in a list.
Mean sequence length: 681.595
Shortest sequence: 667
Longest sequence: 687
Labels: Gossypium_anomalum Gossypium_arboreum Gossypium_areysianum Gossypium_aridum Gossypium_armourianum Gossypium_australe ...
Base composition:
a c g t
0.212 0.302 0.280 0.205 如何使用cotton_acc、species.names、sequence、Base组合的列顺序将其排列成数据格式(总计我将得到37行)。
谢谢
发布于 2014-11-08 11:14:49
以下将给你一个数据框架与物种的名称和DNA序列,至少。由于我对DNA不熟悉,我不知道acc和base composition是什么。在我看来,你需要自己做一些计算才能得到基本的构图。我希望你们这个领域的专家能指导你们更多的工作。
library(dplyr)
library(tidyr)
# http://svitsrv25.epfl.ch/R-doc/library/ape/html/as.alignment.html
# class 'DNAbin' to `character` to get alphabets for DNA sequence
foo <- lapply(cotton, function(x) as.character(x[1:length(x)]))
# A tiny function to create a data.frame with vectors in lists, which I have.
listvec2df <- function(l){
n.obs <- sapply(l, length)
seq.max <- seq_len(max(n.obs))
mydf <- data.frame(sapply(l, "[", i = seq.max), stringsAsFactors = FALSE)
}
# Create a data frame with names from the list (i.e., cotton) and listvec2df(foo),
# which is transposed.
foo2 <- data.frame(names(foo), t(listvec2df(foo)), stringsAsFactors = FALSE)
foo2 <- foo2 %>%
separate(names.foo., c("cotton", "species"), sep = "_")
# cotton species X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
#1 Gossypium anomalum t c g a a a c c t c c c t a a
#2 Gossypium arboreum t c g a a a c c t g c c t a g
#3 Gossypium areysianum t c g a a a c c t g c c t a g
#4 Gossypium aridum t c g a a a c c t g c c t a g
#5 Gossypium armourianum t c g a a a c c t g c c t a gDNA序列有37行(37种)和687列。当DNA序列小于687时,加入NAs。
dim(foo2)
#[1] 37 689https://stackoverflow.com/questions/26814902
复制相似问题