YOLO11火焰检测模型在RDK X5上的边缘部署实践-嵌云网-嵌入式AI开发资源站

YOLO11火焰检测模型在RDK X5上的边缘部署实践

Thepoly

1. 项目概述：YOLO11火焰检测模型在RDK X5上的边缘部署

在边缘计算设备上部署目标检测模型一直是工业界的热门需求。本文将详细介绍如何将一个自定义训练的YOLO11火焰检测模型部署到地瓜派RDK X5开发板的全过程。RDK X5搭载了地平线Bayes-e BPU，提供10 TOPS的INT8算力，非常适合运行轻量级的目标检测模型。

这个项目的主要挑战在于：

模型需要从PyTorch训练环境转换到BPU支持的量化格式
保持模型精度同时最大化推理速度
解决边缘设备上的实时视频流处理问题

整个流程涉及Windows训练环境、Docker量化环境和RDK X5板端环境三个不同的平台，需要处理好各环节的衔接。下面我将分步骤详细讲解每个环节的关键技术和注意事项。

2. 环境准备与工具链配置

2.1 训练环境搭建（Windows/Linux）

对于模型训练，推荐使用conda创建独立的Python环境：

bash复制conda create -n yolo11 python=3.10 -y
conda activate yolo11

# 安装PyTorch（根据CUDA版本选择）
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# 安装Ultralytics库
pip install ultralytics

# 安装ONNX相关工具
pip install onnx onnxruntime

注意：Ultralytics版本建议≥8.1.0，早期版本对YOLO11的支持可能不完善

2.2 Docker量化环境配置

地平线提供了专门的OE（Open Explorer）Docker镜像用于模型量化：

bash复制# 拉取最新版Docker镜像
docker pull openexplorer/ai_toolchain:latest

# 启动容器并挂载工作目录
docker run -it --rm \
    -v /your/local/path:/fire_quant \
    openexplorer/ai_toolchain:latest \
    /bin/bash

关键工具说明：

hb_mapper checker：用于检查ONNX模型与BPU的兼容性
hb_mapper makertbin：执行实际的量化过程
/usr/local/目录下包含完整的BPU工具链

2.3 RDK X5板端环境

RDK X5出厂系统已经预装了必要的运行时环境：

bash复制# 验证hobot_dnn安装
python3 -c "from hobot_dnn import pyeasy_dnn; print('OK')"

# 查看BPU信息
cat /sys/class/bpu/bpu0/info

板端不需要额外安装Python包，但建议准备以下工具：

rsync：用于快速文件同步
tmux：长时间运行任务管理
vim：配置文件编辑

3. 模型训练与优化技巧

3.1 数据集准备与配置

火焰检测数据集应采用标准YOLO格式，目录结构如下：

code复制dataset/
├── images/
│   ├── train/
│   │   ├── img001.jpg
│   │   └── ...
│   └── val/
│       ├── img001.jpg
│       └── ...
└── labels/
    ├── train/
    │   ├── img001.txt  # 格式：class_id cx cy w h (归一化坐标)
    │   └── ...
    └── val/
        ├── img001.txt
        └── ...

对应的数据集配置文件forestfire.yaml：

yaml复制path: /path/to/dataset
train: images/train
val: images/val

nc: 1  # 类别数（火焰检测只有1类）
names: ['fire']  # 类别名称

3.2 训练脚本与参数调优

基础训练脚本：

python复制from ultralytics import YOLO

model = YOLO("yolo11n.pt")  # 使用官方预训练权重

model.train(
    data="forestfire.yaml",
    imgsz=640,
    epochs=100,        # 关键参数：至少100个epoch
    batch=60,          # 根据GPU显存调整
    close_mosaic=10,   # 最后10个epoch关闭mosaic增强
    workers=4,         # 数据加载线程数
    device='0',        # 使用GPU 0
    optimizer='SGD',   # 对于小数据集，SGD通常比Adam更好
    lr0=0.01,          # 初始学习率
    lrf=0.1,           # 最终学习率 = lr0 * lrf
    momentum=0.937,
    weight_decay=0.0005,
    warmup_epochs=3,   # 学习率热身
    warmup_momentum=0.8,
    box=7.5,           # box loss增益
    cls=0.5,           # cls loss增益
    dfl=1.5,           # dfl loss增益
    fl_gamma=0.0,      # focal loss gamma（0表示禁用）
    hsv_h=0.015,       # 色调增强幅度
    hsv_s=0.7,         # 饱和度增强幅度
    hsv_v=0.4,         # 明度增强幅度
    degrees=0.0,       # 旋转角度范围
    translate=0.1,     # 平移范围
    scale=0.5,         # 缩放范围
    shear=0.0,         # 剪切范围
    perspective=0.0,   # 透视变换
    flipud=0.0,        # 上下翻转概率
    fliplr=0.5,        # 左右翻转概率
    mosaic=1.0,        # mosaic增强概率
    mixup=0.0,         # mixup增强概率
    copy_paste=0.0,    # copy-paste增强概率
    project='results',
    name='fire-detect',
)

关键训练建议：

epoch数：至少100个epoch，10个epoch训练的模型量化后效果极差
数据增强：对于火焰检测，适当增加色彩扰动(hsv_s/hsv_v)有助于模型鲁棒性
损失权重：调整box/dfl损失权重可以提高定位精度

3.3 模型验证与指标解读

训练完成后，使用验证集评估模型：

python复制from ultralytics import YOLO

model = YOLO("results/fire-detect/weights/best.pt")
results = model.val()
print(f"mAP50: {results.box.map50:.3f}")  # IoU=0.5时的mAP
print(f"mAP50-95: {results.box.map:.3f}") # IoU=0.5:0.95时的mAP

部署前的模型质量要求：

mAP50 ≥ 0.65（低于此值量化后效果难以保证）
验证集召回率 ≥ 0.7
每个类别都有足够的正样本检测

4. ONNX模型导出关键技术

4.1 标准导出方法的问题

直接使用Ultralytics的model.export()会带来两个严重问题：

DFL解码内置：导致模型被拆分成多个子图，推理速度骤降
Sigmoid激活内置：在BPU上量化效果差，影响分类精度

4.2 自定义导出方案

正确的导出方法需要自定义forward函数，只保留卷积部分：

python复制import torch
from ultralytics import YOLO

model = YOLO("results/fire-detect/weights/best.pt")
head = model.model.model[-1]

# 重写forward，输出6个特征图（3个尺度的bbox和cls）
def new_forward(self, x):
    result = []
    for i in range(self.nl):  # nl=检测头数量（通常为3）
        bbox = self.cv2[i](x[i])  # (B, 64, H, W)
        cls = self.cv3[i](x[i])   # (B, nc, H, W)
        result.append(bbox)
        result.append(cls)
    return result

head.forward = types.MethodType(new_forward, head)

# 导出ONNX
dummy = torch.randn(1, 3, 640, 640)
torch.onnx.export(
    model.model, dummy, "fire_detect.onnx",
    input_names=['images'],
    output_names=['bbox_P3', 'cls_P3', 'bbox_P4', 'cls_P4', 'bbox_P5', 'cls_P5'],
    opset_version=11,
    do_constant_folding=True,
)

关键检查点：

输出必须是6个特征图（bbox和cls交替）
格式必须保持NCHW，不能转NHWC
不能包含DFL和Sigmoid操作

4.3 ONNX模型验证

导出后需要验证模型结构：

python复制import onnx

model = onnx.load("fire_detect.onnx")
# 检查Softmax节点数量（应该只有1个来自attention）
softmax_nodes = [n.name for n in model.graph.node if 'Softmax' in n.op_type]
print(f"Softmax nodes: {len(softmax_nodes)}")

理想情况下，模型中只应包含1个Softmax节点（来自attention模块），不应该包含DFL相关的Softmax。

5. 模型量化全流程解析

5.1 量化配置文件详解

fire_detect_config.yaml关键参数：

yaml复制model_parameters:
  onnx_model: './fire_detect.onnx'
  march: 'bayes-e'  # BPU架构
  output_model_file_prefix: 'fire_detect'
  node_info:
    "/model.10/m/m.0/attn/Softmax":  # 强制指定Softmax在BPU运行
      'ON': 'BPU'
      'InputType': 'int16'
      'OutputType': 'int16'

input_parameters:
  input_name: 'images'
  input_type_rt: 'nv12'   # 板端输入格式
  input_type_train: 'rgb' # 训练输入格式
  input_layout_train: 'NCHW'
  input_shape: '1x3x640x640'
  norm_type: 'data_scale'
  scale_value: '0.003921568627451'  # 1/255

calibration_parameters:
  cal_data_dir: './calibration_f32'
  cal_data_type: 'float32'
  calibration_type: 'max'  # 也可用'kl'

compiler_parameters:
  compile_mode: 'latency'  # 优化延迟
  optimize_level: 'O3'     # 最高优化级别

5.2 校准数据准备

校准数据必须满足：

RGB格式（非BGR）
float32数据类型
NCHW布局
像素值范围0-255（不做归一化）

校准数据生成脚本：

python复制import cv2
import numpy as np

def prepare_image(img_path, target_size=640):
    img = cv2.imread(img_path)
    h, w = img.shape[:2]
    
    # Letterbox缩放
    scale = min(target_size/h, target_size/w)
    nh, nw = int(h*scale), int(w*scale)
    resized = cv2.resize(img, (nw, nh))
    
    # 填充到target_size
    canvas = np.full((target_size, target_size, 3), 114, dtype=np.uint8)
    top = (target_size - nh) // 2
    left = (target_size - nw) // 2
    canvas[top:top+nh, left:left+nw] = resized
    
    # BGR→RGB→float32（不归一化！）
    rgb = cv2.cvtColor(canvas, cv2.COLOR_BGR2RGB)
    rgb_f32 = rgb.astype(np.float32)  # 保持0-255范围
    
    # HWC→NCHW
    nchw = rgb_f32.transpose(2, 0, 1)
    return nchw

# 生成100张校准数据
for img_path in image_paths[:100]:
    data = prepare_image(img_path)
    data.tofile(f"calibration_f32/{idx:06d}.f32")

5.3 执行量化

在Docker环境中运行：

bash复制# 检查模型兼容性
hb_mapper checker --model-type onnx --march bayes-e --model fire_detect.onnx

# 执行量化
hb_mapper makertbin --model-type onnx --config fire_detect_config.yaml

量化完成后检查：

model_output/目录下的.bin文件
日志中的Cosine Similarity（应≥0.99）
子图数量（理想为1-2个）

6. 板端部署与性能优化

6.1 模型加载验证

python复制from hobot_dnn import pyeasy_dnn as dnn

models = dnn.load('fire_detect.bin')
model = models[0]

print("Inputs:")
for inp in model.inputs:
    print(f"  {inp.properties}")

print("\nOutputs:")
for out in model.outputs:
    print(f"  {out.properties}")

预期输出格式：

code复制Output[0]: shape=(1, 64, 80, 80)  # bbox_P3
Output[1]: shape=(1, 1, 80, 80)   # cls_P3
Output[2]: shape=(1, 64, 40, 40)  # bbox_P4
Output[3]: shape=(1, 1, 40, 40)   # cls_P4
Output[4]: shape=(1, 64, 20, 20)  # bbox_P5
Output[5]: shape=(1, 1, 20, 20)   # cls_P5

6.2 实时推理实现

关键处理流程：

图像预处理（BGR→NV12）
模型推理
输出解析（DFL解码）
后处理（NMS）

python复制def bgr_to_nv12(image, target_size=640):
    # Letterbox缩放
    h, w = image.shape[:2]
    scale = min(target_size/h, target_size/w)
    nh, nw = int(h*scale), int(w*scale)
    resized = cv2.resize(image, (nw, nh))
    
    # 填充
    canvas = np.full((target_size, target_size, 3), 114, dtype=np.uint8)
    top = (target_size - nh) // 2
    left = (target_size - nw) // 2
    canvas[top:top+nh, left:left+nw] = resized
    
    # BGR→YUV420→NV12
    yuv = cv2.cvtColor(canvas, cv2.COLOR_BGR2YUV_I420)
    y = yuv[:target_size, :]
    uv = yuv[target_size:, :].reshape(target_size//2, target_size)
    nv12 = np.concatenate([y, uv], axis=0)
    return nv12, scale, left, top

def dfl_decode(bbox_feat, reg_max=16):
    bbox = bbox_feat.reshape(-1, 4, reg_max)
    bbox = torch.softmax(bbox, dim=-1)
    weights = torch.arange(reg_max, dtype=torch.float32)
    return (bbox * weights).sum(-1)

def parse_outputs(outputs, scale, pad_left, pad_top, conf_thresh=0.3):
    # 遍历三个检测头（P3/P4/P5）
    for stride, out_bbox, out_cls in zip([8, 16, 32], outputs[::2], outputs[1::2]):
        # 解析bbox和cls特征图
        bbox = out_bbox.buffer.astype(np.float32)
        cls = out_cls.buffer.astype(np.float32)
        
        # sigmoid分类分数
        scores = 1 / (1 + np.exp(-cls))
        
        # DFL解码
        ltrb = dfl_decode(bbox)
        
        # 生成网格坐标
        h, w = scores.shape[2:]
        gy, gx = np.meshgrid(np.arange(h), np.arange(w), indexing='ij')
        
        # 计算绝对坐标
        x_center = (gx + 0.5) * stride
        y_center = (gy + 0.5) * stride
        x1 = (x_center - ltrb[..., 0] * stride - pad_left) / scale
        y1 = (y_center - ltrb[..., 1] * stride - pad_top) / scale
        x2 = (x_center + ltrb[..., 2] * stride - pad_left) / scale
        y2 = (y_center + ltrb[..., 3] * stride - pad_top) / scale
        
        # 过滤低分检测
        keep = scores > conf_thresh
        boxes = np.stack([x1[keep], y1[keep], x2[keep], y2[keep]], axis=-1)
        yield boxes, scores[keep]

6.3 性能优化技巧

输入处理优化：
- 使用Zero-copy方式传递NV12数据
- 避免不必要的内存拷贝
后处理优化：
- 使用快速sigmoid近似：1/(1 + abs(x))
- 提前过滤低分检测框
多线程处理：
- 使用生产者-消费者模式分离图像获取和推理
- 流水线化预处理-推理-后处理

实测性能（RDK X5）：

640x640输入：~45 FPS
延迟：~22ms（端到端）

7. 常见问题与解决方案

7.1 量化后模型不输出检测框

可能原因：

校准数据做了双重归一化（脚本中/255，BPU又做/255）
训练epoch不足（至少需要100个epoch）
ONNX输出格式不正确（应为NCHW）

排查方法：

python复制# 检查板端模型输出范围
outputs = model.forward(nv12)
for out in outputs[1::2]:  # cls输出
    buf = np.array(out.buffer)
    print(f"cls max: {buf.max()}, sigmoid: {1/(1+np.exp(-buf.max()))}")

正常情况cls输出最大值对应的sigmoid值应>0.3

7.2 推理速度慢

可能原因：

模型被拆分成多个子图
Softmax操作被放到CPU执行

解决方案：

在yaml中强制指定Softmax在BPU运行
使用compile_mode: latency
减少检测头数量（如使用YOLO11n而不是YOLO11s）

7.3 检测框位置错误

可能原因：

ONNX导出时做了多余的permute操作
后处理代码与模型输出格式不匹配
Letterbox缩放参数计算错误

验证方法：

python复制# 检查模型输出shape
for out in model.outputs:
    print(out.properties.shape)
# 应与官方模型一致（如rdk_model_zoo中的yolo示例）

8. 进阶优化方向

模型层面：
- 知识蒸馏（用大模型指导小模型训练）
- 量化感知训练（QAT）
- 通道剪枝（减少参数量）
部署层面：
- 多模型流水线（检测+分类）
- 动态分辨率（根据场景调整输入尺寸）
- 模型分片（将模型拆分到多个BPU核心）
应用层面：
- 火焰面积估算
- 火焰运动轨迹分析
- 多摄像头协同检测

实际部署中，建议从简单模型开始，逐步增加复杂度。火焰检测这类安全关键应用，宁可漏检也不要误检，可以通过提高置信度阈值（如0.5）来减少误报。

YOLO11火焰检测模型在RDK X5上的边缘部署实践

1. 项目概述：YOLO11火焰检测模型在RDK X5上的边缘部署

2. 环境准备与工具链配置

2.1 训练环境搭建（Windows/Linux）

2.2 Docker量化环境配置

2.3 RDK X5板端环境

3. 模型训练与优化技巧

3.1 数据集准备与配置

3.2 训练脚本与参数调优

3.3 模型验证与指标解读

4. ONNX模型导出关键技术

4.1 标准导出方法的问题

4.2 自定义导出方案

4.3 ONNX模型验证

5. 模型量化全流程解析

5.1 量化配置文件详解

5.2 校准数据准备

5.3 执行量化

6. 板端部署与性能优化

6.1 模型加载验证

6.2 实时推理实现

6.3 性能优化技巧

7. 常见问题与解决方案

7.1 量化后模型不输出检测框

7.2 推理速度慢

7.3 检测框位置错误

8. 进阶优化方向

内容推荐