模拟推理正常，但是实际推理全部预测一样的值

2024-04-28 15:29

遇到这样的问题，我编译了https://github.com/airockchip/rknn_model_zoo/blob/main/examples/mobilenet的库，模拟推理结果如下：

W __init__: rknn-toolkit2 version: 1.6.0+81f21f4d
W load_onnx: If you don't need to crop the model, don't set 'inputs'/'input_size_list'/'outputs'!
W load_onnx: It is recommended onnx opset 19, but your onnx model opset is 12!
--> Config model
done
--> Loading model
W load_onnx: Model converted from pytorch, 'opset_version' should be set 19 in torch.onnx.export for successful convert!
Loading : 100%|███████████████████████████████████████████████| 178/178 [00:00<00:00, 231643.22it/s]
done
--> Building model
W build: found outlier value, this may affect quantization accuracy
const nameabs_mean abs_std outlier value
478 0.89 1.59 -15.073
550 0.61 0.68 11.299
577 0.64 0.65 -9.877
604 0.60 0.55 -9.970
GraphPreparing : 100%|██████████████████████████████████████████| 100/100 [00:00<00:00, 1383.19it/s]
Quantizating : 100%|██████████████████████████████████████████████| 100/100 [00:03<00:00, 27.35it/s]
W build: The default input dtype of 'input' is changed from 'float32' to 'int8' in rknn model for performance!
Please take care of this change when deploy rknn model with Runtime API!
W build: The default output dtype of 'output' is changed from 'float32' to 'int8' in rknn model for performance!
Please take care of this change when deploy rknn model with Runtime API!
done
--> Export rknn model
done
--> Init runtime environment
done
--> Running model
W init_runtime: Target is None, use simulator!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
GraphPreparing : 100%|██████████████████████████████████████████| 102/102 [00:00<00:00, 2176.06it/s]
SessionPreparing : 100%|█████████████████████████████████████████| 102/102 [00:00<00:00, 568.67it/s]
--> PostProcess
-----TOP 5-----
[494] score=0.98 class="n03017168 chime, bell, gong"
[653] score=0.00 class="n03764736 milk can"
[505] score=0.00 class="n03063689 coffeepot"
[469] score=0.00 class="n02939185 caldron, cauldron"
[747] score=0.00 class="n04023962 punching bag, punch bag, punching ball, punchball"
done

在板子上的推理结果如下（也是正确的）：

对于我的模型，模拟推理结果如下：

--> Config model
done
--> Loading model
done
--> Building model
W __init__: rknn-toolkit2 version: 1.6.0+81f21f4d
W load_onnx: If you don't need to crop the model, don't set 'inputs'/'input_size_list'/'outputs'!
W load_onnx: It is recommended onnx opset 19, but your onnx model opset is 17!
W load_onnx: Model converted from pytorch, 'opset_version' should be set 19 in torch.onnx.export for successful convert!
Loading : 100%|█████████████████████████████████████████████████| 46/46 [00:00<00:00, 218502.81it/s]
W load_onnx: The config.mean_values is None, zeros will be set for input 0!
W load_onnx: The config.std_values is None, ones will be set for input 0!
W build: Can not find 'idx' to insert, default insert to 0!
GraphPreparing : 100%|████████████████████████████████████████████| 41/41 [00:00<00:00, 1451.59it/s]
Quantizating 1/4: 100%|█████████████████████████████████████████████| 41/41 [00:02<00:00, 19.78it/s]
Quantizating 2/4: 100%|█████████████████████████████████████████████| 41/41 [00:01<00:00, 20.96it/s]
Quantizating 3/4: 100%|█████████████████████████████████████████████| 41/41 [00:01<00:00, 23.68it/s]
Quantizating 4/4: 100%|█████████████████████████████████████████████| 41/41 [00:01<00:00, 28.42it/s]
W build: The default input dtype of 'input' is changed from 'float32' to 'int8' in rknn model for performance!
Please take care of this change when deploy rknn model with Runtime API!
W build: The default output dtype of 'output' is changed from 'float32' to 'int8' in rknn model for performance!
Please take care of this change when deploy rknn model with Runtime API!
done
--> Export rknn model
done
>>> ../model/0.jpg
--> Init runtime environment
done
--> Running model
W init_runtime: Target is None, use simulator!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
GraphPreparing : 100%|████████████████████████████████████████████| 43/43 [00:00<00:00, 1807.82it/s]
SessionPreparing : 100%|██████████████████████████████████████████| 43/43 [00:00<00:00, 1179.42it/s]
-----TOP 2-----
[0] score=0.99 class="others"
[1] score=0.01 class="dog"
done
>>> ../model/1.jpg
--> Init runtime environment
W init_runtime: Target is None, use simulator!
W inference: The 'data_format' is not set, and its default value is 'nhwc'!
GraphPreparing : 0%| | 0/43 [00:00<?, ?it/s]done
--> Running model
GraphPreparing : 100%|████████████████████████████████████████████| 43/43 [00:00<00:00, 1231.53it/s]
SessionPreparing : 100%|██████████████████████████████████████████| 43/43 [00:00<00:00, 1292.83it/s]
-----TOP 2-----
[1] score=0.91 class="dog"
[0] score=0.09 class="others"
done

在这里，我模拟推理了两张图片，结果都是正确的。类别预测也不一样。但是对于我的模型，放在板子上推理的时候，却出现了错误的结果：

# /data/RKNN/rknn_deepsleepv2_demo /root/model.rknn /data/RKNN/samples/0.jpg
init_mobilenet_model
model input num: 1, output num: 1
input tensors:
index=0, name=input, n_dims=4, dims=[1, 400, 800, 3], n_elems=960000, size=960000, fmt=NHWC, type=INT8, qnt_type=AFFINE, zp=-128, scale=1.000000
output tensors:
index=0, name=output, n_dims=2, dims=[1, 2, 0, 0], n_elems=2, size=2, fmt=UNDEFINED, type=INT8, qnt_type=AFFINE, zp=-6, scale=0.029430
input_attrs[0].size_with_stride=960000
output_attrs.size_with_stride=2
model is NHWC input fmt
model input height=400, width=800, channel=3
read_image
origin size=800x400 crop size=800x400
input image: 800 x 400, subsampling: 4:2:0, colorspace: YCbCr, orientation: 1
dma_buf_alloc
inference_mobilenet_model
src width=800 height=400 fmt=0x1 virAddr=0x0xa5f4e000 fd=9
dst width=800 height=400 fmt=0x1 virAddr=0x(nil) fd=3
color=0x0
rga_api version 1.10.0_[2]
rknn_run
Predictions:
Class: 1, Score: 0.9994
Class: 0, Score: 0.0006
The prediction of 1 has been save into /tmp/output.
release_mobilenet_model
#
# /data/RKNN/rknn_deepsleepv2_demo /root/model.rknn /data/RKNN/samples/black.jpg
init_mobilenet_model
model input num: 1, output num: 1
input tensors:
index=0, name=input, n_dims=4, dims=[1, 400, 800, 3], n_elems=960000, size=960000, fmt=NHWC, type=INT8, qnt_type=AFFINE, zp=-128, scale=1.000000
output tensors:
index=0, name=output, n_dims=2, dims=[1, 2, 0, 0], n_elems=2, size=2, fmt=UNDEFINED, type=INT8, qnt_type=AFFINE, zp=-6, scale=0.029430
input_attrs[0].size_with_stride=960000
output_attrs.size_with_stride=2
model is NHWC input fmt
model input height=400, width=800, channel=3
read_image
origin size=800x800 crop size=800x800
input image: 800 x 800, subsampling: 4:2:0, colorspace: YCbCr, orientation: 1
dma_buf_alloc
inference_mobilenet_model
src width=800 height=800 fmt=0x1 virAddr=0x0xa5e30000 fd=9
dst width=800 height=400 fmt=0x1 virAddr=0x(nil) fd=3
color=0x0
rga_api version 1.10.0_[2]
rknn_run
Predictions:
Class: 1, Score: 0.9994
Class: 0, Score: 0.0006
The prediction of 1 has been save into /tmp/output.
release_mobilenet_model

我甚至测试了两张全黑和全白的图片，但是结果都是一样的，预测为：Class: 1, Score: 0.9994；Class: 0, Score: 0.0006。
在做我的demo的模拟推理时，均值方差都是None。在真实推理时，输入图片默认是uint8，同mobilenet的输入的处理。此外没有做其他特殊的处理。但其实感觉可以排除是输入的问题，因为我测试了一百多张图像、好包括全0和全255的图像，真实推理的结果都是相同的。然而，模拟推理时正常的（使用的这里：https://github.com/airockchip/rknn_mode ... bilenet.py），又可以排除是rknn模型的问题。这个问题真的好费解啊。
综上，问题是：使用官方的demo，模拟推理、真实推理都是正确的；使用我的demo，模拟推理是正确的，但是对于真实推理，无论输入什么，结果都是一样的。注：我使用的是同一个编译好的rknn_mobilenet_demo，也就是说，我和官方demo的区别就在于指定的model.rknn路径不一样。

这是整个项目部署的最后一步了。还请不吝赐教，非常感谢！

2024-04-29 12:10

您好，正常会出现输出结果相同一般都是在输入上出现问题，对模型结构没有修改那么模型中有什么归一化参数不对导致数值输出异常的可能性很低。
RV1106 的 rknn api 使用的是零拷贝模式，RKNN中将输入数据传入到模型中是在inference_mobilenet_model中的convert_image中实现的，例程应该是使用DMA+RGA的方式来加速图像处理的过程，对于色彩格式有要求，我觉得问题有可能出现在这里，您可以尝试将800*400 大小的bmp图片（需要去文件头）直接拷贝到模型输入的虚拟地址（app_ctx.input_mems[0]->virt_addr上），跳过convert_image的步骤，查看是否能解决输出相同的问题。

2024-04-30 17:23

Crocodile wrote: ↑2024-04-29 12:10 您好，正常会出现输出结果相同一般都是在输入上出现问题，对模型结构没有修改那么模型中有什么归一化参数不对导致数值输出异常的可能性很低。
RV1106 的 rknn api 使用的是零拷贝模式，RKNN中将输入数据传入到模型中是在inference_mobilenet_model中的convert_image中实现的，例程应该是使用DMA+RGA的方式来加速图像处理的过程，对于色彩格式有要求，我觉得问题有可能出现在这里，您可以尝试将800*400 大小的bmp图片（需要去文件头）直接拷贝到模型输入的虚拟地址（app_ctx.input_mems[0]->virt_addr上），跳过convert_image的步骤，查看是否能解决输出相同的问题。

非常感谢您的答复！我同时尝试了convert_image和convert_image_cpu，运行日志也显示成功装载了图片的数据（见前贴的日志），

src width=800 height=400 fmt=0x1 virAddr=0x0xa5f4e000 fd=9
dst width=800 height=400 fmt=0x1 virAddr=0x(nil) fd=3
color=0x0
rga_api version 1.10.0_[2]

使用的inference_mobilenet_model是一样的，当运行
./inference_mobilenet_model model_mobilenet.rknn bell.jpg是正确的结果；
当运行
./inference_mobilenet_model model_my.rknn bell.jpg
或者
./inference_mobilenet_model model_my.rknn all_black.jpg
是一样的输出。
这看起来是rknn模型的问题？但是rknn模型的模拟推理是正确的。

2024-05-06 1:32

ezreal wrote: ↑2024-04-30 17:23
Crocodile wrote: ↑2024-04-29 12:10 您好，正常会出现输出结果相同一般都是在输入上出现问题，对模型结构没有修改那么模型中有什么归一化参数不对导致数值输出异常的可能性很低。
RV1106 的 rknn api 使用的是零拷贝模式，RKNN中将输入数据传入到模型中是在inference_mobilenet_model中的convert_image中实现的，例程应该是使用DMA+RGA的方式来加速图像处理的过程，对于色彩格式有要求，我觉得问题有可能出现在这里，您可以尝试将800*400 大小的bmp图片（需要去文件头）直接拷贝到模型输入的虚拟地址（app_ctx.input_mems[0]->virt_addr上），跳过convert_image的步骤，查看是否能解决输出相同的问题。
非常感谢您的答复！我同时尝试了convert_image和convert_image_cpu，运行日志也显示成功装载了图片的数据（见前贴的日志），
src width=800 height=400 fmt=0x1 virAddr=0x0xa5f4e000 fd=9
dst width=800 height=400 fmt=0x1 virAddr=0x(nil) fd=3
color=0x0
rga_api version 1.10.0_[2]
使用的inference_mobilenet_model是一样的，当运行
./inference_mobilenet_model model_mobilenet.rknn bell.jpg是正确的结果；
当运行
./inference_mobilenet_model model_my.rknn bell.jpg
或者
./inference_mobilenet_model model_my.rknn all_black.jpg
是一样的输出。
这看起来是rknn模型的问题？但是rknn模型的模拟推理是正确的。

您好，如果您确定排除输入的问题的话，目前我也无法根据您的模型信息作出更多判断，可以考虑我之前提到的可能性较低的情况，您获取到的模型可能有部分层数的weight过小，主要训练bias上，在进行模型装换时精度损失导致无法作用到推理上。您可能需要尝试降低学习率，替换激活函数或者添加归一化处理来解决。