首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Fasta to dataframe -R

Fasta to dataframe -R
EN

Stack Overflow用户
提问于 2014-11-08 07:31:33
回答 1查看 1.4K关注 0票数 0

在我在net (http://www.slideshare.net/schamber/phylogenetics-in-r)中找到的这个脚本中,我生成了一个棉花列表。

代码语言:javascript
复制
require(ape) 
# make vector of accession numbers, for ITS 1 and 2 region for Gossypium (cotton) species 
cotton_acc <- c("U56806", "U12712", "U56810", 
                "U12732", "U12725", "U56786", "U12715", 
                "AF057758", "U56790", "U12716", "U12729", 
                "U56798", "U12727", "U12713", "U12719", 
                "U56811", "U12728", "U12730", "U12731", 
                "U12722", "U56796", "U12714", "U56789", 
                "U56797", "U56801", "U56802", "U12718", 
                "U12710", "U56804", "U12734", "U56809", 
                "U56812", "AF057753", "U12711", "U12717", 
                "U12723", "U12726") 
# get data from Genbank 
cotton <- read.GenBank(cotton_acc, species.names = T) 
# name the sequences with species names instead of access numbers 
names_accs <- data.frame(species = attr(cotton, "species"), accs = names(cotton)) 
names(cotton) <- attr(cotton, "species")
write.dna(cotton, "C:/Users/Comp12/Desktop/cotton.fas", format = "fasta") 

输出:

代码语言:javascript
复制
> cotton
37 DNA sequences in binary format stored in a list.

Mean sequence length: 681.595 
   Shortest sequence: 667 
    Longest sequence: 687 

Labels: Gossypium_anomalum Gossypium_arboreum Gossypium_areysianum Gossypium_aridum Gossypium_armourianum Gossypium_australe ...

Base composition:
    a     c     g     t 
0.212 0.302 0.280 0.205 

如何使用cotton_acc、species.names、sequence、Base组合的列顺序将其排列成数据格式(总计我将得到37行)。

谢谢

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-11-08 11:14:49

以下将给你一个数据框架与物种的名称和DNA序列,至少。由于我对DNA不熟悉,我不知道accbase composition是什么。在我看来,你需要自己做一些计算才能得到基本的构图。我希望你们这个领域的专家能指导你们更多的工作。

代码语言:javascript
复制
library(dplyr)
library(tidyr)

# http://svitsrv25.epfl.ch/R-doc/library/ape/html/as.alignment.html
# class 'DNAbin' to `character` to get alphabets for DNA sequence

foo <- lapply(cotton, function(x) as.character(x[1:length(x)]))

# A tiny function to create a data.frame with vectors in lists, which I have.

listvec2df <- function(l){

    n.obs <- sapply(l, length)
    seq.max <- seq_len(max(n.obs))
    mydf <- data.frame(sapply(l, "[", i = seq.max), stringsAsFactors = FALSE)

}

# Create a data frame with names from the list (i.e., cotton) and listvec2df(foo),
# which is transposed.

foo2 <- data.frame(names(foo), t(listvec2df(foo)), stringsAsFactors = FALSE)
foo2 <- foo2 %>%
        separate(names.foo., c("cotton", "species"), sep = "_")

#      cotton       species X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
#1  Gossypium      anomalum  t  c  g  a  a  a  c  c  t   c   c   c   t   a   a
#2  Gossypium      arboreum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g
#3  Gossypium    areysianum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g
#4  Gossypium        aridum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g
#5  Gossypium   armourianum  t  c  g  a  a  a  c  c  t   g   c   c   t   a   g

DNA序列有37行(37种)和687列。当DNA序列小于687时,加入NAs。

代码语言:javascript
复制
dim(foo2)
#[1]  37 689
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/26814902

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档