文章/答案/技术大牛

发布

问Fasta to dataframe -R
EN

Stack Overflow用户

提问于 2014-11-08 07:31:33

回答 1查看 1.4K关注 0票数 0

在我在net (http://www.slideshare.net/schamber/phylogenetics-in-r)中找到的这个脚本中，我生成了一个棉花列表。

require(ape) 
# make vector of accession numbers, for ITS 1 and 2 region for Gossypium (cotton) species 
cotton_acc <- c("U56806", "U12712", "U56810", 
                "U12732", "U12725", "U56786", "U12715", 
                "AF057758", "U56790", "U12716", "U12729", 
                "U56798", "U12727", "U12713", "U12719", 
                "U56811", "U12728", "U12730", "U12731", 
                "U12722", "U56796", "U12714", "U56789", 
                "U56797", "U56801", "U56802", "U12718", 
                "U12710", "U56804", "U12734", "U56809", 
                "U56812", "AF057753", "U12711", "U12717", 
                "U12723", "U12726") 
# get data from Genbank 
cotton <- read.GenBank(cotton_acc, species.names = T) 
# name the sequences with species names instead of access numbers 
names_accs <- data.frame(species = attr(cotton, "species"), accs = names(cotton)) 
names(cotton) <- attr(cotton, "species")
write.dna(cotton, "C:/Users/Comp12/Desktop/cotton.fas", format = "fasta")

输出：

> cotton
37 DNA sequences in binary format stored in a list.

Mean sequence length: 681.595 
   Shortest sequence: 667 
    Longest sequence: 687 

Labels: Gossypium_anomalum Gossypium_arboreum Gossypium_areysianum Gossypium_aridum Gossypium_armourianum Gossypium_australe ...

Base composition:
    a     c     g     t 
0.212 0.302 0.280 0.205

如何使用cotton_acc、species.names、sequence、Base组合的列顺序将其排列成数据格式(总计我将得到37行)。

谢谢

dataframe

bioinformatics

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-11-08 11:14:49

以下将给你一个数据框架与物种的名称和DNA序列，至少。由于我对DNA不熟悉，我不知道acc和base composition是什么。在我看来，你需要自己做一些计算才能得到基本的构图。我希望你们这个领域的专家能指导你们更多的工作。

library(dplyr)
library(tidyr)

# http://svitsrv25.epfl.ch/R-doc/library/ape/html/as.alignment.html
# class 'DNAbin' to `character` to get alphabets for DNA sequence

foo <- lapply(cotton, function(x) as.character(x[1:length(x)]))

# A tiny function to create a data.frame with vectors in lists, which I have.

listvec2df <- function(l){

    n.obs <- sapply(l, length)
    seq.max <- seq_len(max(n.obs))
    mydf <- data.frame(sapply(l, "[", i = seq.max), stringsAsFactors = FALSE)

}

# Create a data frame with names from the list (i.e., cotton) and listvec2df(foo),
# which is transposed.

foo2 <- data.frame(names(foo), t(listvec2df(foo)), stringsAsFactors = FALSE)
foo2 <- foo2 %>%
        separate(names.foo., c("cotton", "species"), sep = "_")

#      cotton       species X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
#1  Gossypium      anomalum  t  c  g  a  a  a  c  c  t   c   c   c   t   a   a
#2  Gossypium      arboreum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g
#3  Gossypium    areysianum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g
#4  Gossypium        aridum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g
#5  Gossypium   armourianum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g

DNA序列有37行(37种)和687列。当DNA序列小于687时，加入NAs。

dim(foo2)
#[1]  37 689

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/26814902

复制

相似问题

问Fasta to dataframe -R
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Fasta to dataframe -REN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Fasta to dataframe -R
EN