重庆工商大学学报（自然科学版）

引用本文:	郑宇航1；曹雏清1,2.基于多尺度特征混合注意力的连续帧深度估计(J/M/D/N,J:杂志，M：书，D：论文，N：报纸).期刊名称,2024，41（4）：104-111
	CHEN X. Adap tive slidingmode contr ol for discrete2ti me multi2inputmulti2 out put systems[ J ]. Aut omatica, 2006, 42(6): 4272-435

【打印本页】【下载PDF全文】【查看/发表评论】【EndNote】【RefMan】【BibTex】

←前一篇|后一篇→

过刊浏览高级检索

本文已被：浏览 925次下载 1760次	码上扫一扫！
分享到：微信更多字体:加大+\|默认\|缩小-
基于多尺度特征混合注意力的连续帧深度估计
郑宇航1；曹雏清1,2
1. 安徽工程大学计算机与信息学院,安徽芜湖 241000 2. 长三角哈特机器人产业技术研究院,安徽芜湖 241000

摘要:

目的估计获取拍摄物体到相机之间距离的深度信息是单目视觉 SLAM 中获取深度信息的方法,针对无监督单目深度估计算法出现精度不足以及误差较大的问题,提出基于多尺度特征融合的混合注意力机制的连续帧深度估计网络。方法通过深度估计和位姿估计的两种编码器解码器结构分别得到深度信息和 6 自由度的位姿信息,深度信息和位姿信息进行图像重建与原图损失计算输出深度信息,深度估计解码器编码器结构构成 U 型网络,位姿估计网络和深度估计网络使用同一个编码器,通过位姿估计解码器输出位姿信息;在编码器中使用混合注意力机制 CBAM 网络结合 ResNet 网络提取四个不同尺度的特征图,为了提升估计的深度信息轮廓细节在提取的每个不同尺度的特征中再进行分配可学习权重系数提取局部和全局特征再和原始特征进行融合。结果在 KITTI 数据集上进行训练同时进行误差以及精度评估,最后还进行了测试,与经典的 monodepth2 单目方法相比误差评估指标相对误差、均方根误差和对数均方根误差分别降低 0. 034、0. 129 和 0. 002,自制测试图片证明了网络的泛化性。结论使用混合注意力机制结合的 ResNet 网络提取多尺度特征,同时在提取的特征上进行多尺度特征融合提升了深度估计效果,改善了轮廓细节。

关键词: 单目视觉连续帧深度估计混合注意力机制多尺度特征融合

DOI：

分类号:

基金项目:

Continuous Frame Depth Estimation Based on Multi-scale Feature Mixed Attention Mechanism

ZHENG Yuhang1；CAO Chuqing1 2

1. School of Computer and Information Anhui University of Engineering Anhui Wuhu 241000 China 2. Yangtze River Delta HIT Robot Technology Research Institute Anhui Wuhu 241000 China

Abstract:

Objective Estimating the depth information to obtain the distance between the photographed object and the camera is the method to obtain the depth information in monocular vision SLAM. As unsupervised monocular depth estimation algorithms suffer from insufficient accuracy as well as large errors a continuous frame depth estimation network based on a hybrid attention mechanism with multi-scale feature fusion was proposed. Methods Information on depth and 6 degrees of freedom of pose were obtained by two encoder-decoder structures for depth estimation and pose estimation respectively. The depth information and the pose information were used for image reconstruction with the original image loss calculation to output the depth information. The decoder encoder structure for depth estimation formed a U-shaped network and the same encoder was used for both the pose estimation network and the depth estimation network and the pose information was output through the pose estimation decoder. The feature maps at four different scales were extracted in the encoder using a hybrid attention mechanism CBAM network combined with a ResNet network. For the enhancement of the estimated depth information contour details the extracted features of each different scale were then assigned learnable weight coefficients to extract local and global features and then fused with the original features. Results Evaluation of error and accuracy was performed on the KITTI dataset and finally testing was also performed. Compared with the classical monodepth2 monocular method the relative error root mean square error and log root mean square error in the error evaluation metrics were reduced by 0. 034 0. 129 and 0. 002 respectively and self-made test images demonstrated the generalizability of the network. Conclusion The multiscale features are extracted using a ResNet network combined with a hybrid attention mechanism while multiscale feature fusion on the extracted features enhances the depth estimation and improves the contour details.

Key words: monocular vision continuous frame depth estimation hybrid attention mechanism multiscale feature fusion