首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从.fa文件到一个热编码基因序列的锈病程序

从.fa文件到一个热编码基因序列的锈病程序
EN

Code Review用户
提问于 2021-12-28 12:42:15
回答 1查看 332关注 0票数 5

我想写一些在FASTA文件中读取的代码,其中一个热编码序列必然会保存到一个文件中。FASTA文件是生物信息学中常用的基于文本的文件格式。下面是一个例子:

代码语言:javascript
复制
>chr1
ACGTCTCGTGC
NNNNNNNNNNN
ACGTTGGGGGG
>chr2
AGGGTCTGGGT
NNNNNNNNNNN

它有一个带有>的头字符串,然后是一些遗传序列。这个序列中的每个字符都代表一个核苷酸。例如,"A“代表腺嘌呤,"N”代表未知。序列中可能的字符数量是有限的,但是有一些编码允许不同的表示。但这对我来说并不重要,因为我可能会事先过滤。

输出应该是一个文本文件,其中包含一个热编码序列和用于标识的标头。取决于如何选择编码,对于前三个核苷酸(ACG)来说,这可能是这样的:

代码语言:javascript
复制
>chr1
0,0,0,1
0,1,0,0
0,0,1,0

为了实现这一点,我一起编写了一些锈蚀代码,它可以工作,但是它看起来非常慢。也许在这一点上我也应该提到FASTA文件很快就会变得很大(很多GB)。我用一个~60 my的.fa文件尝试了我的代码(货构建-发行版),花费了大约1m.30,而我在10分钟内编写的一些通用Python代码只需要40多个。

现在,我想知道为什么我的锈蚀代码比Python更慢。我是不是用一种惯用的生锈的方式写的?我不想将数据存储在类似于HDF5的东西中。我只想更好地理解锈病。特别是I/O繁重的文本处理任务,因为这些在生物信息学中很常见。

下面是我写的代码:

代码语言:javascript
复制
use std::fs::File;
use std::io::{BufReader, Write};
use std::io::prelude::*;
use std::env;
use std::fs::OpenOptions;

fn main()  -> std::io::Result<()>{

    // gather cmd arguments
    let args: Vec = env::args().collect(); 
    let filename: &String = &args[1];
    let outfile: &String = &args[2];

    // Open the file to read from
    let file = File::open(&filename).unwrap();

    // Create a buffered reader -> instead of reading everything to ram
    let reader = BufReader::new(file);
    let lines = reader.lines(); 

    // this allows to open a file in file append mode to write (or create it)
    let mut to_write_file = OpenOptions::new()
        .append(true)
        .create(true) // Optionally create 
        .open(outfile)
        .expect("Unable to open file");

    // iterate over every line
    for l in lines {

        // check if we get a string back (Result)
        if let Ok(line_string) = l {

            // check for header lines
             if line_string.contains("chr"){
                 println!("Processing: {}",line_string);
                 // write to file so we can keep track of chroms
                 writeln!(to_write_file, "{}",line_string).unwrap();
             }
             else {
                 // iterate over nucleotides
                 for character in line_string.chars() {
                     // one hot encode the nucleotide
                     let nucleotide: String = one_hot_encode_string(String::from(character));
                     writeln!(to_write_file,"{}", nucleotide)?;
                }

            }
        }
    }
    Ok(())
}

fn one_hot_encode_string(nucleotide: String) -> String {
    // One hot encodes a Nucleotide: String

    let encoding: String; // create return variable

    // OHE the nucleotide
    match nucleotide.as_str() {
    "A" => encoding = String::from("0,0,0,1"),
    "G" => encoding = String::from("0,0,1,0"),
    "C" => encoding = String::from("0,1,0,0"),
    "T" => encoding = String::from("1,0,0,0"),
    "N" => encoding = String::from("0,0,0,0"),
    _ => encoding = String::from("NA"),
    }
    encoding
    
}

她也是“天真”的python代码:

代码语言:javascript
复制
import sys
# one hot encode nulceotide
def ohe(nucleotide: str) -> str:
    if nucleotide == "A":
        return("0,0,0,1")
    if nucleotide == "G":
        return("0,0,1,0")
    if nucleotide == "T":
        return("0,1,0,0")
    if nucleotide == "C":
        return("1,0,0,0")
    if nucleotide == "N":
        return("0,0,0,0")
    else:
        return("NA")

def main() -> None:
    # get arguments
    input: str = sys.argv[1]
    output: str = sys.argv[2]

    # open in file
    with open(input) as in_f:
        # open out file
        with open(output, 'a') as out_f:
            for line in in_f:
                # get header
                if line.startswith(">"):
                    out_f.write(f"{line.rstrip()}\n")
                else:
                    # iterate over every nucleotide
                    for character in line.rstrip():
                        # one hot encode character
                        # print(ohe(character))
                        out_f.write(f"{ohe(character)}\n")

if __name__ == "__main__":
    main()
EN

回答 1

Code Review用户

回答已采纳

发布于 2021-12-28 13:17:43

您的程序并不完全等价;一个查看一行是否以>开头,另一个在每一行中查找chr。这在总体上并没有多大的区别,但是游戏区域应该是水平的,不是吗?

第二,正如我在上面提到的,您可能应该尽可能避免String::from调用。因为这些一次热编码是常量的,所以您可以返回一个&'static str

第三,这才是真正不同的地方,您的Python程序正在隐式地使用缓冲写,而不是每次直接写入输出文件。

to_write_file包装在BufReader::new中会产生很大的不同,请参见下面的内容。u60.fasta是一个任意60兆字节的FASTA文件。

原始

代码语言:javascript
复制
(original) $ time cargo run --release -- u60.fasta /dev/null
   Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415)
    Finished release [optimized] target(s) in 0.35s
     Running `target/release/cr272415 u60.fasta /dev/null`
Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly

________________________________________________________
Executed in    6.66 secs    fish           external
   usr time    4.74 secs  113.00 micros    4.74 secs
   sys time    2.44 secs  799.00 micros    2.44 secs

优化

代码语言:javascript
复制
~/b/cr272415 (master) $ time cargo run --release -- u60.fasta /dev/null
   Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415)
    Finished release [optimized] target(s) in 0.38s
     Running `target/release/cr272415 u60.fasta /dev/null`
Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly

________________________________________________________
Executed in  650.30 millis    fish           external
   usr time    1.21 secs    147.00 micros    1.21 secs
   sys time    0.17 secs    939.00 micros    0.16 secs

我不是一个经验丰富的Rustacean,所以可能有人可以改进这一点,但optimized版本如下。一些更改是由cargo clippy建议的(您也应该注意到:-)。

代码语言:javascript
复制
use std::env;
use std::fs::File;
use std::fs::OpenOptions;
use std::io::{BufRead, BufReader, BufWriter, Write};

fn main() -> std::io::Result<()> {
    // gather cmd arguments
    let args: Vec = env::args().collect();
    let filename: &String = &args[1];
    let outfile: &String = &args[2];

    // Open the file to read from
    let file = File::open(&filename).unwrap();

    // Create a buffered reader -> instead of reading everything to ram
    let reader = BufReader::new(file);
    let lines = reader.lines();

    // this allows to open a file in file append mode to write (or create it)
    let mut to_write_file = BufWriter::new(
        OpenOptions::new()
            .append(true)
            .create(true) // Optionally create
            .open(outfile)
            .expect("Unable to open file"),
    );

    // iterate over every line
    for line_string in lines.flatten() {
        // check for header lines
        if line_string.starts_with('>') {
            println!("Processing: {}", line_string);
            // write to file so we can keep track of chroms
            writeln!(to_write_file, "{}", line_string).unwrap();
        } else {
            // iterate over nucleotides
            for character in line_string.chars() {
                // one hot encode the nucleotide
                let nucleotide = one_hot_encode_string(character);
                writeln!(to_write_file, "{}", nucleotide)?;
            }
        }
    }
    Ok(())
}

fn one_hot_encode_string(nucleotide: char) -> &'static str {
    // One hot encodes a Nucleotide character
    match nucleotide {
        'A' => "0,0,0,1",
        'G' => "0,0,1,0",
        'C' => "0,1,0,0",
        'T' => "1,0,0,0",
        'N' => "0,0,0,0",
        _ => "NA",
    }
}

编辑⚡️

我脑子里的一些背景线索还意识到了让这件事更快的另一个方法:不要使用writeln!。不需要花时间解析格式字符串和分配内存,只需添加换行符。如果将\n预编到输出中,则只需使用BufWriter.write_all() --即将上面程序的末尾更改为

代码语言:javascript
复制
                let nucleotide = one_hot_encode_string(character);
                to_write_file.write_all(nucleotide.as_ref())?;
            }
        }
    }
    Ok(())
}

fn one_hot_encode_string(nucleotide: char) -> &'static str {
    // One hot encodes a Nucleotide character
    match nucleotide {
        'A' => "0,0,0,1\n",
        'G' => "0,0,1,0\n",
        'C' => "0,1,0,0\n",
        'T' => "1,0,0,0\n",
        'N' => "0,0,0,0\n",
        _ => "NA\n",
    }
}

hyperfine一起运行的基准测试使用了上千个单词,就像他们说的:

代码语言:javascript
复制
~/b/cr272415 (master) $ hyperfine --warmup 3 './original/release/cr272415 u60.fasta /dev/null' './opt1/release/cr272415 u60.fasta /dev/null' './opt2/release/cr272415 u60.fasta /dev/null'
Benchmark 1: ./original/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      6.401 s ±  0.441 s    [User: 4.004 s, System: 2.385 s]
  Range (min … max):    5.978 s …  7.292 s    10 runs

Benchmark 2: ./opt1/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):     170.8 ms ±   4.2 ms    [User: 167.0 ms, System: 2.8 ms]
  Range (min … max):   164.5 ms … 178.4 ms    17 runs

Benchmark 3: ./opt2/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      59.2 ms ±   2.0 ms    [User: 55.8 ms, System: 2.5 ms]
  Range (min … max):    56.5 ms …  65.6 ms    45 runs

Summary
  './opt2/release/cr272415 u60.fasta /dev/null' ran
    2.89 ± 0.12 times faster than './opt1/release/cr272415 u60.fasta /dev/null'
  108.21 ± 8.29 times faster than './original/release/cr272415 u60.fasta /dev/null'

编辑⚡️x 2

我还意识到了一件事(尽管这可能是CPU缓存--在一个更大的程序中效率低下):内联整个带加号的写调用,并将常量标记为字节字符串,从而完全避免.as_ref()。超级精细说这是1.32 ± 0.11 times faster than opt2 (上面的编辑)。

代码语言:javascript
复制
macro_rules! wf {
    ($s:expr) => {
        to_write_file.write_all($s)
    };
}
for character in line_string.chars() {
    match character {
        'A' => wf!(b"0,0,0,1\n"),
        'G' => wf!(b"0,0,1,0\n"),
        'C' => wf!(b"0,1,0,0\n"),
        'T' => wf!(b"1,0,0,0\n"),
        'N' => wf!(b"0,0,0,0\n"),
        _ => wf!(b"NA\n"),
    }
    .unwrap();
}
代码语言:javascript
复制
~/b/cr272415 (vecw *) $ ./bench.sh
Benchmark 1: ./targets/original/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      6.294 s ±  0.182 s    [User: 3.942 s, System: 2.340 s]
  Range (min … max):    6.060 s …  6.592 s    10 runs

Benchmark 2: ./targets/opt1/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):     176.7 ms ±   6.9 ms    [User: 172.1 ms, System: 3.2 ms]
  Range (min … max):   166.1 ms … 188.5 ms    16 runs

Benchmark 3: ./targets/opt2/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      61.5 ms ±   2.9 ms    [User: 57.8 ms, System: 2.6 ms]
  Range (min … max):    58.4 ms …  73.8 ms    43 runs

Benchmark 4: ./targets/vecw/release/cr272415 u60.fasta /dev/null
  Time (mean ± σ):      46.0 ms ±   2.9 ms    [User: 42.6 ms, System: 2.5 ms]
  Range (min … max):    42.2 ms …  55.7 ms    60 runs

Summary
  './targets/vecw/release/cr272415 u60.fasta /dev/null' ran
    1.33 ± 0.11 times faster than './targets/opt2/release/cr272415 u60.fasta /dev/null'
    3.84 ± 0.29 times faster than './targets/opt1/release/cr272415 u60.fasta /dev/null'
  136.71 ± 9.55 times faster than './targets/original/release/cr272415 u60.fasta /dev/null'
票数 6
EN
页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://codereview.stackexchange.com/questions/272415

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档