Long-Tailed 3D Detection via Multi-Modal Fusion

Yechi Ma, Neehar Peri, Achal Dave, Wei Hua, Deva Ramanan, Shu Kong

公開日: 2023/12/18

Abstract

Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors. While class labels naturally follow a long-tailed distribution in the real world, existing benchmarks only focus on a few common classes (e.g., pedestrian and car) and neglect many rare but crucial classes (e.g., emergency vehicle and stroller). However, AVs must reliably detect both common and rare classes for safe operation in the open world. We address this challenge by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates all annotated classes, including those in-the-tail. We address LT3D with hierarchical losses that promote feature sharing across classes, and introduce diagnostic metrics that award partial credit to "reasonable" mistakes with respect to the semantic hierarchy. Further, we point out that rare-class accuracy is particularly improved via multi-modal late fusion (MMLF) of independently trained uni-modal LiDAR and RGB detectors. Such an MMLF framework allows us to leverage large-scale uni-modal datasets (with more examples for rare classes) to train better uni-modal detectors. Finally, we examine three critical components of our simple MMLF approach from first principles: whether to train 2D or 3D RGB detectors for fusion, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections. Extensive experiments reveal that 2D RGB detectors achieve better recognition accuracy for rare classes than 3D RGB detectors, matching on the 2D image plane mitigates depth estimation errors for better matching, and score calibration and probabilistic fusion notably improves the final performance further. Our MMLF significantly outperforms prior work for LT3D, particularly improving on the six rarest classes from 12.8 to 20.0 mAP! Our code and models are available on our project page.

Long-Tailed 3D Detection via Multi-Modal Fusion | SummarXiv | SummarXiv