文章/答案/技术大牛

发布

社区首页 >问答首页 >从.fa文件到一个热编码基因序列的锈病程序

问从.fa文件到一个热编码基因序列的锈病程序
EN

Code Review用户

提问于 2021-12-28 12:42:15

回答 1查看 332关注 0票数 5

我想写一些在FASTA文件中读取的代码，其中一个热编码序列必然会保存到一个文件中。FASTA文件是生物信息学中常用的基于文本的文件格式。下面是一个例子：

>chr1
ACGTCTCGTGC
NNNNNNNNNNN
ACGTTGGGGGG
>chr2
AGGGTCTGGGT
NNNNNNNNNNN

它有一个带有>的头字符串，然后是一些遗传序列。这个序列中的每个字符都代表一个核苷酸。例如，"A“代表腺嘌呤，"N”代表未知。序列中可能的字符数量是有限的，但是有一些编码允许不同的表示。但这对我来说并不重要，因为我可能会事先过滤。

输出应该是一个文本文件，其中包含一个热编码序列和用于标识的标头。取决于如何选择编码，对于前三个核苷酸(ACG)来说，这可能是这样的：

>chr1
0,0,0,1
0,1,0,0
0,0,1,0

为了实现这一点，我一起编写了一些锈蚀代码，它可以工作，但是它看起来非常慢。也许在这一点上我也应该提到FASTA文件很快就会变得很大(很多GB)。我用一个~60 my的.fa文件尝试了我的代码(货构建-发行版)，花费了大约1m.30，而我在10分钟内编写的一些通用Python代码只需要40多个。

现在，我想知道为什么我的锈蚀代码比Python更慢。我是不是用一种惯用的生锈的方式写的？我不想将数据存储在类似于HDF5的东西中。我只想更好地理解锈病。特别是I/O繁重的文本处理任务，因为这些在生物信息学中很常见。

下面是我写的代码：

use std::fs::File;
use std::io::{BufReader, Write};
use std::io::prelude::*;
use std::env;
use std::fs::OpenOptions;

fn main()  -> std::io::Result<()>{

    // gather cmd arguments
    let args: Vec = env::args().collect(); 
    let filename: &String = &args[1];
    let outfile: &String = &args[2];

    // Open the file to read from
    let file = File::open(&filename).unwrap();

    // Create a buffered reader -> instead of reading everything to ram
    let reader = BufReader::new(file);
    let lines = reader.lines(); 

    // this allows to open a file in file append mode to write (or create it)
    let mut to_write_file = OpenOptions::new()
        .append(true)
        .create(true) // Optionally create 
        .open(outfile)
        .expect("Unable to open file");

    // iterate over every line
    for l in lines {

        // check if we get a string back (Result)
        if let Ok(line_string) = l {

            // check for header lines
             if line_string.contains("chr"){
                 println!("Processing: {}",line_string);
                 // write to file so we can keep track of chroms
                 writeln!(to_write_file, "{}",line_string).unwrap();
             }
             else {
                 // iterate over nucleotides
                 for character in line_string.chars() {
                     // one hot encode the nucleotide
                     let nucleotide: String = one_hot_encode_string(String::from(character));
                     writeln!(to_write_file,"{}", nucleotide)?;
                }

            }
        }
    }
    Ok(())
}

fn one_hot_encode_string(nucleotide: String) -> String {
    // One hot encodes a Nucleotide: String

    let encoding: String; // create return variable

    // OHE the nucleotide
    match nucleotide.as_str() {
    "A" => encoding = String::from("0,0,0,1"),
    "G" => encoding = String::from("0,0,1,0"),
    "C" => encoding = String::from("0,1,0,0"),
    "T" => encoding = String::from("1,0,0,0"),
    "N" => encoding = String::from("0,0,0,0"),
    _ => encoding = String::from("NA"),
    }
    encoding
    
}

她也是“天真”的python代码：

import sys
# one hot encode nulceotide
def ohe(nucleotide: str) -> str:
    if nucleotide == "A":
        return("0,0,0,1")
    if nucleotide == "G":
        return("0,0,1,0")
    if nucleotide == "T":
        return("0,1,0,0")
    if nucleotide == "C":
        return("1,0,0,0")
    if nucleotide == "N":
        return("0,0,0,0")
    else:
        return("NA")

def main() -> None:
    # get arguments
    input: str = sys.argv[1]
    output: str = sys.argv[2]

    # open in file
    with open(input) as in_f:
        # open out file
        with open(output, 'a') as out_f:
            for line in in_f:
                # get header
                if line.startswith(">"):
                    out_f.write(f"{line.rstrip()}\n")
                else:
                    # iterate over every nucleotide
                    for character in line.rstrip():
                        # one hot encode character
                        # print(ohe(character))
                        out_f.write(f"{ohe(character)}\n")

if __name__ == "__main__":
    main()

python

rust

bioinformatics

回答 1

Code Review用户

回答已采纳

发布于 2021-12-28 13:17:43

您的程序并不完全等价；一个查看一行是否以>开头，另一个在每一行中查找chr。这在总体上并没有多大的区别，但是游戏区域应该是水平的，不是吗？

第二，正如我在上面提到的，您可能应该尽可能避免String::from调用。因为这些一次热编码是常量的，所以您可以返回一个&'static str。

第三，这才是真正不同的地方，您的Python程序正在隐式地使用缓冲写，而不是每次直接写入输出文件。

将to_write_file包装在BufReader::new中会产生很大的不同，请参见下面的内容。u60.fasta是一个任意60兆字节的FASTA文件。

原始

(original) $ time cargo run --release -- u60.fasta /dev/null
   Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415)
    Finished release [optimized] target(s) in 0.35s
     Running `target/release/cr272415 u60.fasta /dev/null`
Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly

________________________________________________________
Executed in    6.66 secs    fish           external
   usr time    4.74 secs  113.00 micros    4.74 secs
   sys time    2.44 secs  799.00 micros    2.44 secs

优化

~/b/cr272415 (master) $ time cargo run --release -- u60.fasta /dev/null
   Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415)
    Finished release [optimized] target(s) in 0.38s
     Running `target/release/cr272415 u60.fasta /dev/null`
Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly

________________________________________________________
Executed in  650.30 millis    fish           external
   usr time    1.21 secs    147.00 micros    1.21 secs
   sys time    0.17 secs    939.00 micros    0.16 secs

我不是一个经验丰富的Rustacean，所以可能有人可以改进这一点，但optimized版本如下。一些更改是由cargo clippy建议的(您也应该注意到:-)。

use std::env;
use std::fs::File;
use std::fs::OpenOptions;
use std::io::{BufRead, BufReader, BufWriter, Write};

fn main() -> std::io::Result<()> {
    // gather cmd arguments
    let args: Vec = env::args().collect();
    let filename: &String = &args[1];
    let outfile: &String = &args[2];

    // Open the file to read from
    let file = File::open(&filename).unwrap();

    // Create a buffered reader -> instead of reading everything to ram
    let reader = BufReader::new(file);
    let lines = reader.lines();

    // this allows to open a file in file append mode to write (or create it)
    let mut to_write_file = BufWriter::new(
        OpenOptions::new()
            .append(true)
            .create(true) // Optionally create
            .open(outfile)
            .expect("Unable to open file"),
    );

    // iterate over every line
    for line_string in lines.flatten() {
        // check for header lines
        if line_string.starts_with('>') {
            println!("Processing: {}", line_string);
            // write to file so we can keep track of chroms
            writeln!(to_write_file, "{}", line_string).unwrap();
        } else {
            // iterate over nucleotides
            for character in line_string.chars() {
                // one hot encode the nucleotide
                let nucleotide = one_hot_encode_string(character);
                writeln!(to_write_file, "{}", nucleotide)?;
            }
        }
    }
    Ok(())
}

fn one_hot_encode_string(nucleotide: char) -> &'static str {
    // One hot encodes a Nucleotide character
    match nucleotide {
        'A' => "0,0,0,1",
        'G' => "0,0,1,0",
        'C' => "0,1,0,0",
        'T' => "1,0,0,0",
        'N' => "0,0,0,0",
        _ => "NA",
    }
}

编辑⚡️

我脑子里的一些背景线索还意识到了让这件事更快的另一个方法:不要使用writeln!。不需要花时间解析格式字符串和分配内存，只需添加换行符。如果将\n预编到输出中，则只需使用BufWriter.write_all() --即将上面程序的末尾更改为

                let nucleotide = one_hot_encode_string(character);
                to_write_file.write_all(nucleotide.as_ref())?;
            }
        }
    }
    Ok(())
}

fn one_hot_encode_string(nucleotide: char) -> &'static str {
    // One hot encodes a Nucleotide character
    match nucleotide {
        'A' => "0,0,0,1\n",
        'G' => "0,0,1,0\n",
        'C' => "0,1,0,0\n",
        'T' => "1,0,0,0\n",
        'N' => "0,0,0,0\n",
        _ => "NA\n",
    }
}

与hyperfine一起运行的基准测试使用了上千个单词，就像他们说的：

~/b/cr272415 (master) $ hyperfine --warmup 3 './original/release/cr272415 u60.fasta /dev/null' './opt1/release/cr272415 u60.fasta /dev/null' './opt2/release/cr272415 u60.fasta /dev/null'
Benchmark 1: ./original/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      6.401 s ±  0.441 s    [User: 4.004 s, System: 2.385 s]
  Range (min … max):    5.978 s …  7.292 s    10 runs

Benchmark 2: ./opt1/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):     170.8 ms ±   4.2 ms    [User: 167.0 ms, System: 2.8 ms]
  Range (min … max):   164.5 ms … 178.4 ms    17 runs

Benchmark 3: ./opt2/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      59.2 ms ±   2.0 ms    [User: 55.8 ms, System: 2.5 ms]
  Range (min … max):    56.5 ms …  65.6 ms    45 runs

Summary
  './opt2/release/cr272415 u60.fasta /dev/null' ran
    2.89 ± 0.12 times faster than './opt1/release/cr272415 u60.fasta /dev/null'
  108.21 ± 8.29 times faster than './original/release/cr272415 u60.fasta /dev/null'

编辑⚡️x 2

我还意识到了一件事(尽管这可能是CPU缓存--在一个更大的程序中效率低下)：内联整个带加号的写调用，并将常量标记为字节字符串，从而完全避免.as_ref()。超级精细说这是1.32 ± 0.11 times faster than opt2 (上面的编辑)。

macro_rules! wf {
    ($s:expr) => {
        to_write_file.write_all($s)
    };
}
for character in line_string.chars() {
    match character {
        'A' => wf!(b"0,0,0,1\n"),
        'G' => wf!(b"0,0,1,0\n"),
        'C' => wf!(b"0,1,0,0\n"),
        'T' => wf!(b"1,0,0,0\n"),
        'N' => wf!(b"0,0,0,0\n"),
        _ => wf!(b"NA\n"),
    }
    .unwrap();
}

~/b/cr272415 (vecw *) $ ./bench.sh
Benchmark 1: ./targets/original/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      6.294 s ±  0.182 s    [User: 3.942 s, System: 2.340 s]
  Range (min … max):    6.060 s …  6.592 s    10 runs

Benchmark 2: ./targets/opt1/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):     176.7 ms ±   6.9 ms    [User: 172.1 ms, System: 3.2 ms]
  Range (min … max):   166.1 ms … 188.5 ms    16 runs

Benchmark 3: ./targets/opt2/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      61.5 ms ±   2.9 ms    [User: 57.8 ms, System: 2.6 ms]
  Range (min … max):    58.4 ms …  73.8 ms    43 runs

Benchmark 4: ./targets/vecw/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      46.0 ms ±   2.9 ms    [User: 42.6 ms, System: 2.5 ms]
  Range (min … max):    42.2 ms …  55.7 ms    60 runs

Summary
  './targets/vecw/release/cr272415 u60.fasta /dev/null' ran
    1.33 ± 0.11 times faster than './targets/opt2/release/cr272415 u60.fasta /dev/null'
    3.84 ± 0.29 times faster than './targets/opt1/release/cr272415 u60.fasta /dev/null'
  136.71 ± 9.55 times faster than './targets/original/release/cr272415 u60.fasta /dev/null'

票数 6

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/272415

复制

相似问题

问从.fa文件到一个热编码基因序列的锈病程序
EN

回答 1

Code Review用户

原始

优化

编辑⚡️

编辑⚡️x 2

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从.fa文件到一个热编码基因序列的锈病程序EN

回答 1

Code Review用户

原始

优化

编辑⚡️

编辑⚡️x 2

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从.fa文件到一个热编码基因序列的锈病程序
EN