文章/答案/技术大牛

发布

社区首页 >问答首页 >如何使用perl删除gff3文件中的重叠区域？

问如何使用perl删除gff3文件中的重叠区域？
EN

Stack Overflow用户

提问于 2017-11-07 12:37:10

回答 1查看 145关注 0票数 2

我试图从9列gff3文件中删除重叠区域。

**Input file:**
scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007
scaffold591 Source transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322458 3322458 . - . Parent=g24007.t1
scaffold591 Source gene 3322500 3346055 0.41 - . ID=g24007
scaffold591 Source transcript 3322500 3346055 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322500 3322500 . - . Parent=g24007.t1
scaffold591 Source gene 3377307 3513095 0.46 + . ID=g24008
scaffold591 Source transcript 3377307 3513095 0.41 + . ID=g24008.t1;Parent=g24008
scaffold591 Source transcription_end_site 3377307 3377307 . + . Parent=g24008.t1

在这里，我只试图比较具有相同链的“基因”的行，即“或+”(第7列)。

例如，第1行和第4行。

scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007
scaffold591 Source gene 3322500 3346055 0.41 - . ID=g24007

它们是来自同一个支架和相同的“链”(第7列)的“基因”。row4坐标(列4和5)位于第1行坐标的范围内。在这种情况下，我的代码应该删除重叠的第4行，并保留具有较大范围的row1。

**My expected output:**
scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007
scaffold591 Source transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322458 3322458 . - . Parent=g24007.t1
scaffold591 Source gene 3377307 3513095 0.46 + . ID=g24008
scaffold591 Source transcript 3377307 3513095 0.41 + . ID=g24007.t1;Parent=g24008
scaffold591 Source transcription_end_site 3377307 3377307 . + . Parent=g24008.t1

我的代码两次打印row1及其下面的行

My code:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ max };

open (IN, "<scaffold_sample.txt");

my $previous_seqid ="";
my $previous_strand;
my $previous_start;
my $previous_end;
my @gff;
my @tmp;
while (<IN>)
 {
    chomp;
    my ($seqid,$source, $region, $start, $end, $score, $strand, $frame, $attribute) = split ("\t",$_);  
    @gff = ($seqid,$source, $region, $start, $end, $score, $strand, $frame, $attribute);

    if ($seqid eq $previous_seqid && $strand eq $previous_strand && $region eq 'gene')
    {
        if($start < $previous_end && $end < $previous_end)
        {
            @gff = @tmp;
            $previous_seqid = $gff[0];
            $previous_strand = $gff[6];
            $previous_start = $gff[3];
            $previous_end = $gff[4];
            print join "\t",@gff;
            print "\n";
        }
        else
        {
            @tmp = @gff;
        }

    }
    else
    {
        @tmp = ($seqid,$source, $region, $start, $end, $score, $strand, $frame, $attribute);
        $previous_seqid = $seqid;
        $previous_strand = $strand;
        $previous_start = $start;
        $previous_end = $end;
        print join "\t",@tmp;
        print "\n";
    }

}

请帮帮忙。

perl

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-11-07 16:36:58

这是一个有趣的问题。您想要去dup行，但是(我认为)如果您在文件的后面找到一个更大的范围，您希望在找到原始的、较小的范围的位置输出这个大范围。

老实说，我没有看你的解决方案，而是从零开始。

我用了两种数据结构。%line_data包含我们处理过的行的详细信息。它是一个多层次的散列，在seqid、strand和region上进行键控.如果一个新记录与散列中的值不匹配，那么我们第一次是seqid、strand和region的组合。如果一个新的记录确实匹配，那么我们以前见过这种组合，我们计算出其中哪一个有最大的范围，并在必要时覆盖。

然后是@lines，它包含我们将要输出的数据。它包含对%line_data中散列的引用。当发现更大的范围时，需要进行一些家务活来保持最新的状态。

所以这就是我最后的下场。它为您的输入提供了正确的输出，但我不知道它是否会中断更多不同的输入。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

my @lines;
my %line_data;

# Column names (for use as hash keys)    
my @cols = qw[seqid source region start end score strand frame attribute];

# Store the input data in DATA for easier testing
while (<DATA>) {
  my %record;
  # Split a record into a hash
  @record{@cols} = split;

  # If this key combination exists...
  if (exists $line_data{$record{seqid}}{$record{strand}}{$record{region}}) {
    # Get the previous record with these keys...
    my $prev = $line_data{$record{seqid}}{$record{strand}}{$record{region}};
    # See if the new range is larger...
    if ($record{start} > $prev->{start} and $record{end} > $prev->{end}) {
      # If so, overwrite it.
      $line_data{$record{seqid}}{$record{strand}}{$record{region}} = \%record;
      $lines[$prev->{pos}] = \%record;
      $record{post} = $prev->{pos};
    }
  } else {
    # We haven't seen this key combination before.
    # So just store it.
    $line_data{$record{seqid}}{$record{strand}}{$record{region}} = \%record;
    push @lines, \%record;
    $record{pos} = $#lines;
  }
}

# Having processed the data, we walk the @lines array,
# de-referencing the hash and joining the values with a space.
foreach (@lines) {
  say join ' ', @$_{@cols};
}

__DATA__
scaffold591 Source gene 3322458 3376057 0.41 - . ID=g24007
scaffold591 Source transcript 3322458 3376057 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322458 3322458 . - . Parent=g24007.t1
scaffold591 Source gene 3322500 3346055 0.41 - . ID=g24007
scaffold591 Source transcript 3322500 3346055 0.41 - . ID=g24007.t1;Parent=g24007
scaffold591 Source transcription_end_site 3322500 3322500 . - . Parent=g24007.t1
scaffold591 Source gene 3377307 3513095 0.46 + . ID=g24008
scaffold591 Source transcript 3377307 3513095 0.41 + . ID=g24008.t1;Parent=g24008
scaffold591 Source transcription_end_site 3377307 3377307 . + . Parent=g24008.t1

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47158204

复制

相似问题

问如何使用perl删除gff3文件中的重叠区域？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用perl删除gff3文件中的重叠区域？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使用perl删除gff3文件中的重叠区域？
EN