文章/答案/技术大牛

发布

社区首页 >问答首页 >无法从Emsembl FASTA中删除换行符

问无法从Emsembl FASTA中删除换行符
EN

Stack Overflow用户

提问于 2013-05-02 03:24:20

回答 1查看 150关注 0票数 0

我在试着从Ensembl FASTA文件中找到蛋白质模体。我已经完成了大部分脚本，比如检索序列ID和序列本身，但我收到了一些有趣的结果。

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my $motif1 = qr/(HE(\D)(\D)H(\D{18})E)/x;
my $motif2 = qr/(AMEN)/x;
my $input;
my $output;
my $count_total     = 0;
my $count_processed = 0;
my $total_run       = 0;
my $id;
my $seq;
my $motif1_count    = 0;
my $motif2_count    = 0;
my $motifboth_count = 0;

############################################################################################################################
# FILEHANDLING - INPUT/OUTPUT
# User input prompting and handling
print "**********************************************************\n";
print "Question 3\n";
print "**********************************************************\n";

#opens the user input file previously assigned to varible to new variable or kills script.
open my $fh, '<', "chr2.txt" || die "Error! Cannot open file:$!\n";

#Opens and creates output file previously assigned to variable to new variable or kills script
#open(RESULTS, '>', $output)||die "Error! Cannot create output file:$!\n";

# FILE and DATA PROCESSING
############################################################################################################################

while (<$fh>) {

    if (/^>(\S+)/) {
        $count_total = ++$count_total;    # Plus one to count
        find_motifs($id, $seq) if $seq;   # Passing to subroutine
        $id = substr($1, 0, 15);          # Taking only the first 16 characters for the id
        $seq = '';
    }
    else {
        chomp;
        $seq .= $_;
    }
}

print "Total proteins: $count_total \n";
print "Proteins with both motifs: $motifboth_count \n";
print "Proteins with motif 1: $motif1_count \n";
print "Proteins with motif 2: $motif2_count \n";

exit;

######################################################################################################################################
# SUBROUTINES
#
# Takes passed variables from special array
# Finds the position of motif within seq
# Checks for motif 1 presence and if found, checks for motif 2. If not found, prints motif 1 results
# If no motif 1, checks for motif 2

sub find_motifs {
    my ($id, $seq) = @_;
    if ($seq =~ $motif1) {
        my $motif_position = index $seq, $1;
        my $motif = $1;
        if ($seq =~ $motif2) {
            $motif1_count    = ++$motif1_count;
            $motif2_count    = ++$motif2_count;
            $motifboth_count = ++$motifboth_count;
            print "$id, $motif_position, \n$motif \n";
        }
        else {
            $motif1_count = ++$motif1_count;
            print "$id, $motif_position,\n $motif\n\n";
        }
    }
    elsif ($seq =~ $motif2) {
        $motif2_count = ++$motif2_count;
    }
}

发生的情况是，如果在一行数据的末尾和下一行的开头找到了motif，它将在数据中返回带有换行符的motif。这种吸纳数据的方法以前工作得很好。

示例结果：

ENSG00000119013, 6,  HEHGHHKMELPDYRQWKIEGTPLE (CORRECT!)

ENSG00000142327, 123,  HEVAHSWFGNAVTNATWEEMWLSE (CORRECT!) 

ENSG00000151694, 410, **AECAPNEFGAEHDPDGL**

这就是问题所在。motif匹配，但返回前半部分，即换行符，然后在同一行上打印后半部分(这是更大问题的症状-去掉换行符！)

Total proteins: 13653  
Proteins with both motifs: 1  
Proteins with motif 1: 12  
Proteins with motif 2: 22

我在脚本中的不同位置尝试了不同的方法，比如@seq =~ s/\r//g或‘s/\n//g。

perl

newline

fasta

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-05-02 03:47:27

从您的描述中看不清楚，但是“也在同一行上打印后半部分”听起来像是您的输出被覆盖了，因为它在末尾有一个回车字符。

如果你在Linux系统上运行，并且你只需要chomp一行来自Windows的代码，就会发生这种情况。

您应该用s/\s+\z//替换chomp，这将删除所有尾随的空格。因为回车符和换行符都被算作“空格”，所以它将删除所有可能的终止字符。

顺便说一下，您误解了++运算符的用途。它还会修改它所应用的变量的内容，因此您所需要的就是++$motif1_count等。您的代码按原样工作，因为运算符还返回递增的变量的值，所以$motif1_count = ++$motif1_count首先递增变量，然后将其赋值给自己。

此外，您可以在正则表达式中使用\D。您是否知道这与任何非数字字符匹配？这似乎是一个非常模糊的分类是有用的。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/16324925

复制

相似问题

问无法从Emsembl FASTA中删除换行符
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法从Emsembl FASTA中删除换行符EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法从Emsembl FASTA中删除换行符
EN