首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >计算每个转录本的平均外显子数

计算每个转录本的平均外显子数
EN

Stack Overflow用户
提问于 2020-06-08 16:02:04
回答 1查看 1.4K关注 0票数 2

我想要能够计算外显子,,每,转录,在我的文件中的平均数量。

只使用FYI :我正在使用的文件格式称为GENCODE GTF文件。

要查找每个文本的外显子数,我们必须在以分号分隔的字段9 ($9)列表中搜索,但只搜索标签分隔文件的外显子特征行(字符串"exon“位于字段3($3)中的行)。

每个唯一的transcript_id都有一系列相关的外显子特征线,字段9($9)中的"exon_number n“数量越来越多:

代码语言:javascript
复制
...; transcript_id "ENST00000488147.1"; ...; exon_number 1; ...
...; transcript_id "ENST00000488147.1"; ...; exon_number 2; ...
...
...; transcript_id "ENST00000488147.1"; ...; exon_number n; ...

要获得每个转录本的平均外显子数,只需从每个唯一的exon_number中添加最大的trasnscript_id,然后除以使用的唯一transcript_id的总数。

以下是我文件中任何给定行的10种可能格式:

代码语言:javascript
复制
chr19   HAVANA   gene   405438   409170   .   -   .   gene_id "ENSG00000183186.7"; gene_type "protein_coding"; gene_name "C2CD4C"; level 2; havana_gene "OTTHUMG00000180534.3";
chr19   HAVANA   transcript   405438   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   exon   409006   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   exon   405438   408401   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   CDS   407099   408361   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   start_codon   408359   408361   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   stop_codon   407096   407098   .   -   0   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   409006   409170   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 1; exon_id "ENSE00001322986.5"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   405438   407098   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";
chr19   HAVANA   UTR   408362   408401   .   -   .   gene_id "ENSG00000183186.7"; transcript_id "ENST00000332235.7"; gene_type "protein_coding"; gene_name "C2CD4C"; transcript_type "protein_coding"; transcript_name "C2CD4C-001"; exon_number 2; exon_id "ENSE00001290344.6"; level 2; protein_id "ENSP00000328677.4"; transcript_support_level "2"; tag "basic"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS45890.1"; havana_gene "OTTHUMG00000180534.3"; havana_transcript "OTTHUMT00000451789.3";

注意:字符串"exon_number n“出现在除"gene”和"transcript“之外的所有特征类型(字段3)的行中,因此,计算应该只针对字段3中"exon”的行。

示例行来自实际文件

代码语言:javascript
复制
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-002"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  transcript      12010   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  exon    12010   12057   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 1; exon_id "ENSE00001948541.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  exon    12179   12227   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 2; exon_id "ENSE00001671638.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  exon    12613   12697   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 3; exon_id "ENSE00001758273.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  exon    12975   13052   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 4; exon_id "ENSE00001799933.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  exon    13221   13374   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 5; exon_id "ENSE00001746346.2"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  exon    13453   13670   .       +       .       gene_id "ENSG00000223972.5"; transcript_id "ENST00000450305.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-001"; exon_number 6; exon_id "ENSE00001863096.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000002844.2";
chr1    HAVANA  gene    14404   29570   .       -       .       gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; level 2; havana_gene "OTTHUMG00000000958.1";
chr1    HAVANA  transcript      14404   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    29534   29570   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 1; exon_id "ENSE00001890219.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    24738   24891   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 2; exon_id "ENSE00003507205.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    18268   18366   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 3; exon_id "ENSE00003477500.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    17915   18061   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 4; exon_id "ENSE00003565697.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    17606   17742   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 5; exon_id "ENSE00003475637.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    17233   17368   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 6; exon_id "ENSE00003502542.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    16858   17055   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 7; exon_id "ENSE00003553898.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    16607   16765   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 8; exon_id "ENSE00003621279.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    15796   15947   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 9; exon_id "ENSE00002030414.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    15005   15038   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 10; exon_id "ENSE00001935574.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    HAVANA  exon    14404   14501   .       -       .       gene_id "ENSG00000227232.5"; transcript_id "ENST00000488147.1"; gene_type "unprocessed_pseudogene"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_name "WASH7P-001"; exon_number 11; exon_id "ENSE00001843071.1"; level 2; transcript_support_level "NA"; ont "PGO:0000005"; tag "basic"; havana_gene "OTTHUMG00000000958.1"; havana_transcript "OTTHUMT00000002839.1";
chr1    ENSEMBL gene    17369   17436   .       -       .       gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_name "MIR6859-1"; level 3;

基于我的示例:所需的输出:

代码语言:javascript
复制
6.66667

由于ENST00000456328.2有3个外显子,ENST00000450305.2有6个外显子,ENST00000488147.1 ~有11个外显子(3+6+ 11)/3 = 6.66667

尝试解决方案

(i) I试图为字段3中所有具有"exon“的行打印一个列表字段9。

代码语言:javascript
复制
if [ $3 == "exon" ]; then
    awk '{a[$9]++}END{for(i in a){print i, a[i]}}' file
fi

我相信这意味着我的代码只是计算第9栏的第一部分:gene_id。不幸的是,我无法得到一个数字的平均数。

(Ii)我试着打印字段3中所有有"exon“的行,然后数一数exon_number为一个独特的"transcript_id”出现的次数,但是我现在对awk的了解非常薄弱,所以我觉得脚本已经过时了。下面输出一个唯一gene_id的列表以及它们各自的计数,这样输出文件的每一行看起来都类似于这个"ENSG00000242201.3"; 1"ENSG00000132478.9"; 41"ENSG00000155099.7"; 34

代码语言:javascript
复制
awk '$3=="exon" {print $0}' file | awk '{count[$10]++} END {for (word in count) print word, count[word]}'

生成的计数是不准确的,因为它们只是将所有相同的gene_id的行与外显子相加。这意味着可以在文件中找到/统计ENSG00000183186.747次数,但由于可选剪接(此问题的生物学相关性),该基因实际上会产生许多不同大小的转录本,而代码却忽略了这一点。它应该指定为"transcript_id",而不是"gene_id“。

(iii) --这段代码正在进行中,需要进行大量的重新加工。我尝试(1)只有当字段3包含字符串exon时才提取字段9,(2)拆分";“分隔列表,只提取每个唯一"transcript_id”的最高值exon_number n,然后(3)计算每个转录本的平均外显子数。

代码语言:javascript
复制
awk '$3 == "exon" {print $9}' file
#step needed here to split the “;” separated fields $9 to extract the max exon_number
awk '{ total += $1 } END { print total/NR }' file
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-06-08 18:30:03

好吧,我还是不明白整个故事。所以无论如何:

代码语言:javascript
复制
# extract the lines with 3rd column equal to "exon"
awk '$3 == "exon"' |
# Extract the interesting parts with a regex. I like `sed` more, so I'll use it.
# extract the transcript_id and exon number
# separator is ';'
sed -n 's/.*transcript_id \([^;]*\).*exon_number \([^;]*\).*/\1;\2/p' |
# because I noticed that number is equal to count of lines
# just count the lines with the same transcript_id
# (actually exon_number is ignored)
# and then calculate the average of the array
awk -F';' '{a[$1]++} END{for (i in a) {cnt++; sum += a[i]} print sum/cnt}'
# or you could also like get the highest number from exon_number with like
# awk ... '{ if (a[$1] < $2) a[$1]=$2 } ...'

作为在repl上测试脚本输出提供的输入

代码语言:javascript
复制
6.66667

我在上面留下了3个命令,所以看起来很冗长,同样的工作可以用一个awk脚本来完成,这比三个进程链更快。只需在awk中使用gensub,而不是在awk中使用sedsplit将输入拆分到awk中的数组。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62266289

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档