我认为在nVidia Jetson上运行的是一个非常轻量级的CNN,带有Jetpack4.4。nVidia声称纳米可以以36的价格运行ResNet-50.,因此我期望我的小得多的网络能够轻松地在30+ fps上运行。
事实上,每一次向前传球都需要160到180毫秒,所以我最多能得到5-6 fps。在生产中,必须在实时摄像机流上进行实时预测,因此使用批处理>1来提高每个样本的性能并不是一种选择。
我的推理代码有什么根本问题吗?我是否错误地认为,与例如ResNet-50相比,网络体系结构的计算速度应该相当快?我能做些什么才能弄清楚到底是什么花了这么多时间?
我的CNN:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lambda (Lambda) (None, 210, 848, 3) 0
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 210, 282, 3) 0
_________________________________________________________________
conv2d (Conv2D) (None, 102, 138, 16) 2368
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 51, 69, 16) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 24, 33, 32) 12832
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 12, 16, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 4, 6, 64) 51264
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 2, 3, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 384) 0
_________________________________________________________________
dropout (Dropout) (None, 384) 0
_________________________________________________________________
dense (Dense) (None, 64) 24640
_________________________________________________________________
dropout_1 (Dropout) (None, 64) 0
_________________________________________________________________
elu (ELU) (None, 64) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 65
=================================================================
Total params: 91,169
Trainable params: 91,169
Non-trainable params: 0
_________________________________________________________________代码:
import numpy as np
import cv2
import time
import tensorflow as tf
from tensorflow import keras
model_name = 'v9_small_FC_epoch_3'
loaded_model = keras.models.load_model('/home/jetson/notebooks/trained_models/' + model_name + '.h5')
loaded_model.summary()
frame = cv2.imread('/home/jetson/notebooks/frame1.jpg')
test_data = np.expand_dims(frame, axis=0)
for i in range(10):
start = time.time()
predictions = loaded_model.predict(test_data)
print(predictions[0][0])
end = time.time()
print("Inference took {}s".format(end-start))结果:
4.7763316333293915
Inference took 10.111131191253662s
4.7763316333293915
Inference took 0.1822071075439453s
4.7763316333293915
Inference took 0.17330455780029297s
4.7763316333293915
Inference took 0.18085694313049316s
4.7763316333293915
Inference took 0.16646790504455566s
4.7763316333293915
Inference took 0.1703803539276123s
4.7763316333293915
Inference took 0.1788337230682373s
4.7763316333293915
Inference took 0.17131853103637695s
4.7763316333293915
Inference took 0.1660606861114502s
4.7763316333293915
Inference took 0.18377089500427246s编辑:为了确保我不只是低估我的网络,我用一个只包含一个输出和一个输出神经元组成的网络来代替它。正如预期的那样,模型的初始加载速度明显较快,但在此之后,推理几乎同样缓慢。
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lambda (Lambda) (None, 1, 1, 1) 0
_________________________________________________________________
dense (Dense) (None, 1, 1, 1) 2
=================================================================
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________
2021-01-06 20:44:22.361558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
Inference took 1.9230175018310547s
Inference took 0.17112112045288086s
Inference took 0.16610288619995117s
Inference took 0.1768038272857666s
Inference took 0.16962003707885742s
Inference took 0.16416263580322266s
Inference took 0.17536258697509766s
Inference took 0.16603755950927734s
Inference took 0.16376280784606934s
Inference took 0.16828060150146484s在我的桌面上(i5-2500k,GTX1070Ti),即使是第一个预测也只需要26 my:
Inference took 0.02569293975830078s
Inference took 0.026061534881591797s
Inference took 0.023118019104003906s
Inference took 0.023060083389282227s
Inference took 0.02504444122314453s
Inference took 0.02664470672607422s发布于 2021-01-07 00:42:29
似乎转换为TensorRT可以提高性能超过10倍(!)对我来说,我一点也没想到。
缺点是,加载TensorRT模型现在需要>2分钟,而且由于某些原因,我无法理解脚本占用了2.2G内存。同时,让转换过程开始工作也是非常痛苦的。关于这个话题,我要开始一个新的问答,因为似乎很多人最终都放弃了。
TensorRT模型似乎需要一些热身(~100次传球),然后以最终的推理速度确定,在我的例子中,这个速度是15-17ms (68-66fps)。我要说的是相当惊人的进步。
Inference took 100.2991828918457s
Inference took 0.2558176517486572s
Inference took 0.04433894157409668s
Inference took 0.037764787673950195s
Inference took 0.03640627861022949s
Inference took 0.04129934310913086s
Inference took 0.024821043014526367s
Inference took 0.0219266414642334s
...
Inference took 0.0170745849609375s
Inference took 0.016851186752319336s
Inference took 0.016122817993164062s
Inference took 0.01502084732055664s
Inference took 0.015442371368408203s
Inference took 0.01560211181640625s在没有TensorRT的情况下,推理不仅要花费更长的时间,偶尔也要花费更长的时间,在某些情况下高达750 to。对于实时应用程序来说,这是一个交易的破坏者。
使用TensorRT,推理时间是相当稳定的,我没有看到超过15%的变化,在10K的通行证。
https://datascience.stackexchange.com/questions/87601
复制相似问题