19 November 2024 Dual-domain deformable feature fusion for multi-modal 3D object detection
Shihao Wang, Tao Deng
Author Affiliations +
Abstract

Recent advancements in 3D object detection using light detection and ranging (LiDAR)-camera fusion have enhanced autonomous driving perception. However, aligning LiDAR and image data during multimodal fusion remains a significant challenge. We propose a novel multi-modal feature alignment and fusion architecture to effectively align and fuse voxel and image data. The proposed architecture comprises four key modules. Z-axis attention aggregates voxel features along the vertical axis using self-attention. Voxel-domain deformable encoder improves context understanding with deformable attention to encode voxel features. Dual-domain deformable feature alignment uses deformable attention to adaptively align voxel and image features, addressing resolution mismatches. Finally, gated fusion utilizes a gating mechanism to dynamically fuse aligned features. The multi-layer design further enhances feature detail retention and improves dual-domain fusion performance. Experimental results show our method increases average precision by 2.41% at the “hard” difficulty level for cars on the KITTI test set. On the KITTI validation set, mean average precision improves by 1.06% for cars, 6.88% for pedestrians, and 1.83% for cyclists.

© 2024 SPIE and IS&T
Shihao Wang and Tao Deng "Dual-domain deformable feature fusion for multi-modal 3D object detection," Journal of Electronic Imaging 33(6), 063022 (19 November 2024). https://doi.org/10.1117/1.JEI.33.6.063022
Received: 15 August 2024; Accepted: 30 October 2024; Published: 19 November 2024
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
Back to Top