我想写一些在FASTA文件中读取的代码,其中一个热编码序列必然会保存到一个文件中。FASTA文件是生物信息学中常用的基于文本的文件格式。下面是一个例子:
>chr1
ACGTCTCGTGC
NNNNNNNNNNN
ACGTTGGGGGG
>chr2
AGGGTCTGGGT
NNNNNNNNNNN它有一个带有>的头字符串,然后是一些遗传序列。这个序列中的每个字符都代表一个核苷酸。例如,"A“代表腺嘌呤,"N”代表未知。序列中可能的字符数量是有限的,但是有一些编码允许不同的表示。但这对我来说并不重要,因为我可能会事先过滤。
输出应该是一个文本文件,其中包含一个热编码序列和用于标识的标头。取决于如何选择编码,对于前三个核苷酸(ACG)来说,这可能是这样的:
>chr1
0,0,0,1
0,1,0,0
0,0,1,0为了实现这一点,我一起编写了一些锈蚀代码,它可以工作,但是它看起来非常慢。也许在这一点上我也应该提到FASTA文件很快就会变得很大(很多GB)。我用一个~60 my的.fa文件尝试了我的代码(货构建-发行版),花费了大约1m.30,而我在10分钟内编写的一些通用Python代码只需要40多个。
现在,我想知道为什么我的锈蚀代码比Python更慢。我是不是用一种惯用的生锈的方式写的?我不想将数据存储在类似于HDF5的东西中。我只想更好地理解锈病。特别是I/O繁重的文本处理任务,因为这些在生物信息学中很常见。
下面是我写的代码:
use std::fs::File;
use std::io::{BufReader, Write};
use std::io::prelude::*;
use std::env;
use std::fs::OpenOptions;
fn main() -> std::io::Result<()>{
// gather cmd arguments
let args: Vec = env::args().collect();
let filename: &String = &args[1];
let outfile: &String = &args[2];
// Open the file to read from
let file = File::open(&filename).unwrap();
// Create a buffered reader -> instead of reading everything to ram
let reader = BufReader::new(file);
let lines = reader.lines();
// this allows to open a file in file append mode to write (or create it)
let mut to_write_file = OpenOptions::new()
.append(true)
.create(true) // Optionally create
.open(outfile)
.expect("Unable to open file");
// iterate over every line
for l in lines {
// check if we get a string back (Result)
if let Ok(line_string) = l {
// check for header lines
if line_string.contains("chr"){
println!("Processing: {}",line_string);
// write to file so we can keep track of chroms
writeln!(to_write_file, "{}",line_string).unwrap();
}
else {
// iterate over nucleotides
for character in line_string.chars() {
// one hot encode the nucleotide
let nucleotide: String = one_hot_encode_string(String::from(character));
writeln!(to_write_file,"{}", nucleotide)?;
}
}
}
}
Ok(())
}
fn one_hot_encode_string(nucleotide: String) -> String {
// One hot encodes a Nucleotide: String
let encoding: String; // create return variable
// OHE the nucleotide
match nucleotide.as_str() {
"A" => encoding = String::from("0,0,0,1"),
"G" => encoding = String::from("0,0,1,0"),
"C" => encoding = String::from("0,1,0,0"),
"T" => encoding = String::from("1,0,0,0"),
"N" => encoding = String::from("0,0,0,0"),
_ => encoding = String::from("NA"),
}
encoding
}她也是“天真”的python代码:
import sys
# one hot encode nulceotide
def ohe(nucleotide: str) -> str:
if nucleotide == "A":
return("0,0,0,1")
if nucleotide == "G":
return("0,0,1,0")
if nucleotide == "T":
return("0,1,0,0")
if nucleotide == "C":
return("1,0,0,0")
if nucleotide == "N":
return("0,0,0,0")
else:
return("NA")
def main() -> None:
# get arguments
input: str = sys.argv[1]
output: str = sys.argv[2]
# open in file
with open(input) as in_f:
# open out file
with open(output, 'a') as out_f:
for line in in_f:
# get header
if line.startswith(">"):
out_f.write(f"{line.rstrip()}\n")
else:
# iterate over every nucleotide
for character in line.rstrip():
# one hot encode character
# print(ohe(character))
out_f.write(f"{ohe(character)}\n")
if __name__ == "__main__":
main()发布于 2021-12-28 13:17:43
您的程序并不完全等价;一个查看一行是否以>开头,另一个在每一行中查找chr。这在总体上并没有多大的区别,但是游戏区域应该是水平的,不是吗?
第二,正如我在上面提到的,您可能应该尽可能避免String::from调用。因为这些一次热编码是常量的,所以您可以返回一个&'static str。
第三,这才是真正不同的地方,您的Python程序正在隐式地使用缓冲写,而不是每次直接写入输出文件。
将to_write_file包装在BufReader::new中会产生很大的不同,请参见下面的内容。u60.fasta是一个任意60兆字节的FASTA文件。
(original) $ time cargo run --release -- u60.fasta /dev/null
Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415)
Finished release [optimized] target(s) in 0.35s
Running `target/release/cr272415 u60.fasta /dev/null`
Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
________________________________________________________
Executed in 6.66 secs fish external
usr time 4.74 secs 113.00 micros 4.74 secs
sys time 2.44 secs 799.00 micros 2.44 secs~/b/cr272415 (master) $ time cargo run --release -- u60.fasta /dev/null
Compiling cr272415 v0.1.0 (/Users/akx/build/cr272415)
Finished release [optimized] target(s) in 0.38s
Running `target/release/cr272415 u60.fasta /dev/null`
Processing: >NC_000001.11 Homo sapiens chromosome 1, GRCh38.p13 Primary Assembly
________________________________________________________
Executed in 650.30 millis fish external
usr time 1.21 secs 147.00 micros 1.21 secs
sys time 0.17 secs 939.00 micros 0.16 secs我不是一个经验丰富的Rustacean,所以可能有人可以改进这一点,但optimized版本如下。一些更改是由cargo clippy建议的(您也应该注意到:-)。
use std::env;
use std::fs::File;
use std::fs::OpenOptions;
use std::io::{BufRead, BufReader, BufWriter, Write};
fn main() -> std::io::Result<()> {
// gather cmd arguments
let args: Vec = env::args().collect();
let filename: &String = &args[1];
let outfile: &String = &args[2];
// Open the file to read from
let file = File::open(&filename).unwrap();
// Create a buffered reader -> instead of reading everything to ram
let reader = BufReader::new(file);
let lines = reader.lines();
// this allows to open a file in file append mode to write (or create it)
let mut to_write_file = BufWriter::new(
OpenOptions::new()
.append(true)
.create(true) // Optionally create
.open(outfile)
.expect("Unable to open file"),
);
// iterate over every line
for line_string in lines.flatten() {
// check for header lines
if line_string.starts_with('>') {
println!("Processing: {}", line_string);
// write to file so we can keep track of chroms
writeln!(to_write_file, "{}", line_string).unwrap();
} else {
// iterate over nucleotides
for character in line_string.chars() {
// one hot encode the nucleotide
let nucleotide = one_hot_encode_string(character);
writeln!(to_write_file, "{}", nucleotide)?;
}
}
}
Ok(())
}
fn one_hot_encode_string(nucleotide: char) -> &'static str {
// One hot encodes a Nucleotide character
match nucleotide {
'A' => "0,0,0,1",
'G' => "0,0,1,0",
'C' => "0,1,0,0",
'T' => "1,0,0,0",
'N' => "0,0,0,0",
_ => "NA",
}
}我脑子里的一些背景线索还意识到了让这件事更快的另一个方法:不要使用writeln!。不需要花时间解析格式字符串和分配内存,只需添加换行符。如果将\n预编到输出中,则只需使用BufWriter.write_all() --即将上面程序的末尾更改为
let nucleotide = one_hot_encode_string(character);
to_write_file.write_all(nucleotide.as_ref())?;
}
}
}
Ok(())
}
fn one_hot_encode_string(nucleotide: char) -> &'static str {
// One hot encodes a Nucleotide character
match nucleotide {
'A' => "0,0,0,1\n",
'G' => "0,0,1,0\n",
'C' => "0,1,0,0\n",
'T' => "1,0,0,0\n",
'N' => "0,0,0,0\n",
_ => "NA\n",
}
}与hyperfine一起运行的基准测试使用了上千个单词,就像他们说的:
~/b/cr272415 (master) $ hyperfine --warmup 3 './original/release/cr272415 u60.fasta /dev/null' './opt1/release/cr272415 u60.fasta /dev/null' './opt2/release/cr272415 u60.fasta /dev/null'
Benchmark 1: ./original/release/cr272415 u60.fasta /dev/null
Time (mean ± σ): 6.401 s ± 0.441 s [User: 4.004 s, System: 2.385 s]
Range (min … max): 5.978 s … 7.292 s 10 runs
Benchmark 2: ./opt1/release/cr272415 u60.fasta /dev/null
Time (mean ± σ): 170.8 ms ± 4.2 ms [User: 167.0 ms, System: 2.8 ms]
Range (min … max): 164.5 ms … 178.4 ms 17 runs
Benchmark 3: ./opt2/release/cr272415 u60.fasta /dev/null
Time (mean ± σ): 59.2 ms ± 2.0 ms [User: 55.8 ms, System: 2.5 ms]
Range (min … max): 56.5 ms … 65.6 ms 45 runs
Summary
'./opt2/release/cr272415 u60.fasta /dev/null' ran
2.89 ± 0.12 times faster than './opt1/release/cr272415 u60.fasta /dev/null'
108.21 ± 8.29 times faster than './original/release/cr272415 u60.fasta /dev/null'我还意识到了一件事(尽管这可能是CPU缓存--在一个更大的程序中效率低下):内联整个带加号的写调用,并将常量标记为字节字符串,从而完全避免.as_ref()。超级精细说这是1.32 ± 0.11 times faster than opt2 (上面的编辑)。
macro_rules! wf {
($s:expr) => {
to_write_file.write_all($s)
};
}
for character in line_string.chars() {
match character {
'A' => wf!(b"0,0,0,1\n"),
'G' => wf!(b"0,0,1,0\n"),
'C' => wf!(b"0,1,0,0\n"),
'T' => wf!(b"1,0,0,0\n"),
'N' => wf!(b"0,0,0,0\n"),
_ => wf!(b"NA\n"),
}
.unwrap();
}~/b/cr272415 (vecw *) $ ./bench.sh
Benchmark 1: ./targets/original/release/cr272415 u60.fasta /dev/null
Time (mean ± σ): 6.294 s ± 0.182 s [User: 3.942 s, System: 2.340 s]
Range (min … max): 6.060 s … 6.592 s 10 runs
Benchmark 2: ./targets/opt1/release/cr272415 u60.fasta /dev/null
Time (mean ± σ): 176.7 ms ± 6.9 ms [User: 172.1 ms, System: 3.2 ms]
Range (min … max): 166.1 ms … 188.5 ms 16 runs
Benchmark 3: ./targets/opt2/release/cr272415 u60.fasta /dev/null
Time (mean ± σ): 61.5 ms ± 2.9 ms [User: 57.8 ms, System: 2.6 ms]
Range (min … max): 58.4 ms … 73.8 ms 43 runs
Benchmark 4: ./targets/vecw/release/cr272415 u60.fasta /dev/null
Time (mean ± σ): 46.0 ms ± 2.9 ms [User: 42.6 ms, System: 2.5 ms]
Range (min … max): 42.2 ms … 55.7 ms 60 runs
Summary
'./targets/vecw/release/cr272415 u60.fasta /dev/null' ran
1.33 ± 0.11 times faster than './targets/opt2/release/cr272415 u60.fasta /dev/null'
3.84 ± 0.29 times faster than './targets/opt1/release/cr272415 u60.fasta /dev/null'
136.71 ± 9.55 times faster than './targets/original/release/cr272415 u60.fasta /dev/null'https://codereview.stackexchange.com/questions/272415
复制相似问题