我在Java和Rust上运行了一个相同的小型基准测试。
Java:
public class Main {
private static final int NUM_ITERS = 100;
public static void main(String[] args) {
long tInit = System.nanoTime();
int c = 0;
for (int i = 0; i < NUM_ITERS; ++i) {
for (int j = 0; j < NUM_ITERS; ++j) {
for (int k = 0; k < NUM_ITERS; ++k) {
if (i*i + j*j == k*k) {
++c;
System.out.println(i + " " + j + " " + k);
}
}
}
}
System.out.println(c);
System.out.println(System.nanoTime() - tInit);
}
}生锈:
use std::time::SystemTime;
const NUM_ITERS: i32 = 100;
fn main() {
let t_init = SystemTime::now();
let mut c = 0;
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
println!("{} {} {}", i, j, k);
}
}
}
}
println!("{}", c);
println!("{}", t_init.elapsed().unwrap().as_nanos());
}正如预期的那样,当NUM_ITERS = 100胜过Java时
Java: 59311348 ns
Rust: 29629242 ns但是对于NUM_ITERS = 1000,我发现生锈花费的时间要长得多,而Java要快得多
Java: 1585835361 ns
Rust: 28623818145 ns这可能是什么原因呢?在这种情况下,Rust不也应该比Java表现得更好吗?或者是因为我在实现中犯了一些错误?
更新
我从代码中删除了System.out.println(i + " " + j + " " + k);和println!("{} {} {}", i, j, k);行。下面是输出
NUM_ITERS = 100
Java: 3843114 ns
Rust: 29072345 ns
NUM_ITERS = 1000
Java: 1014829974 ns
Rust: 28402166953 ns因此,在没有println语句的情况下,Java在这两种情况下都比Rust执行得更好。我只是想知道为什么会这样。Java正在运行垃圾收集器和其他开销。我没有在Rust中以最佳方式实现循环吗?
发布于 2021-04-22 21:14:16
我调整了您的代码,以消除注释中列出的批判点。不为生产环境编译Rust是最大的问题,这带来了50倍的开销。除此之外,我取消了测量时的打印,并对Java代码进行了适当的预热。
我要说的是,Java和Rust在这些变化之后是平起平坐的,它们彼此之间的差距在2倍以内,而且每次迭代的成本都非常低(只有几分之一纳秒)。
下面是我的代码:
public class Testing {
private static final int NUM_ITERS = 1_000;
private static final int MEASURE_TIMES = 7;
public static void main(String[] args) {
for (int i = 0; i < MEASURE_TIMES; i++) {
System.out.format("%.2f ns per iteration%n", benchmark());
}
}
private static double benchmark() {
long tInit = System.nanoTime();
int c = 0;
for (int i = 0; i < NUM_ITERS; ++i) {
for (int j = 0; j < NUM_ITERS; ++j) {
for (int k = 0; k < NUM_ITERS; ++k) {
if (i*i + j*j == k*k) {
++c;
}
}
}
}
if (c % 137 == 0) {
// Use c so its computation can't be elided
System.out.println("Count is divisible by 13: " + c);
}
long tookNanos = System.nanoTime() - tInit;
return tookNanos / ((double) NUM_ITERS * NUM_ITERS * NUM_ITERS);
}
}use std::time::SystemTime;
const NUM_ITERS: i32 = 1000;
fn main() {
let mut c = 0;
let t_init = SystemTime::now();
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
}
}
}
}
let took_ns = t_init.elapsed().unwrap().as_nanos() as f64;
let iters = NUM_ITERS as f64;
println!("{} ns per iteration", took_ns / (iters * iters * iters));
// Use c to ensure its computation can't be elided by the optimizer
if c % 137 == 0 {
println!("Count is divisible by 137: {}", c);
}
}我使用JDK16从IntelliJ运行Java,使用cargo run --release从命令行运行Rust。
Java输出示例:
0.98 ns per iteration
0.93 ns per iteration
0.32 ns per iteration
0.34 ns per iteration
0.32 ns per iteration
0.33 ns per iteration
0.32 ns per iterationRust输出示例:
0.600314 ns per iteration虽然我不一定惊讶于看到Java提供了更好的结果(它的JIT编译器已经优化了20年,现在没有对象分配,所以没有GC),但我对迭代的总体低成本感到困惑。我们可以假设表达式i*i + j*j被提升到内部循环之外,这样就只剩下k*k在里面了。
我使用反汇编程序检查了Rust生成的代码。它在最内层的循环中肯定涉及到IMUL。我读过this answer,上面说英特尔的IMUL指令的延迟只有3cpu周期。将其与多个ALU和指令并行相结合,每次迭代1个周期的结果变得更加合理。
我发现的另一件有趣的事情是,如果我只检查c % 137 == 0,但不在Rust println!语句中打印c的实际值,(只打印"Count可被137整除“),迭代成本下降到只有0.26 ns。因此,当我没有询问c的确切值时,Rust能够从循环中消除大量工作。
更新
正如在@trentci的注释中所讨论的,我更完整地模仿了Java代码,添加了一个重复测量的外部循环,该循环现在位于一个单独的函数中:
use std::time::SystemTime;
const NUM_ITERS: i32 = 1000;
const MEASURE_TIMES: i32 = 7;
fn main() {
let total_iters: f64 = NUM_ITERS as f64 * NUM_ITERS as f64 * NUM_ITERS as f64;
for _ in 0..MEASURE_TIMES {
let took_ns = benchmark() as f64;
println!("{} ns per iteration", took_ns / total_iters);
}
}
fn benchmark() -> u128 {
let mut c = 0;
let t_init = SystemTime::now();
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
}
}
}
}
// Use c to ensure its computation can't be elided by the optimizer
if c % 137 == 0 {
println!("Count is divisible by 137: {}", c);
}
return t_init.elapsed().unwrap().as_nanos();
}现在我得到这个输出:
0.781475 ns per iteration
0.760657 ns per iteration
0.783821 ns per iteration
0.777313 ns per iteration
0.766473 ns per iteration
0.774042 ns per iteration
0.766718 ns per iteration代码的另一个细微变化导致了性能的显着变化。然而,它也展示了Rust相对于Java的一个关键优势:不需要预热就能获得最佳性能。
https://stackoverflow.com/questions/67211077
复制相似问题