Direct 3D Detection of Vehicles in Monocular Images with a CNN Based 3D Decoder

Michael Weber¹, Michael Fürst¹, Johann Marius Zöllner²

¹FZI Research Center for Information Technology
²FZI Forschungszentrum Informatik

Details

11:00 - 12:30 | Mon 10 Jun | Room 5 | MoAM_P1.3

Session: Poster 1: AV + Vision

Full Text

Abstract

In autonomous driving, the detection of objects like surrounding vehicles based on monocular RGB images is usually performed by 2D bounding box detectors. The resulting 2D objects can be used for a first coarse 3D position estimate but for a precise location, additional sensor data has to be taken into account. For further use in sensor fusion systems and environment maps it is preferable to detect objects, their orientation and dimensions directly in 3D coordinates. To address this 3D object detection task, we propose a direct 3D bounding box estimator which is realized as CNN decoder module and can be connected to most 2D object detectors like SSD, OverFeat, YOLO and RetinaNet or directly to CNN feature extractors like VGG and ResNet. The 3D parameters of the objects such as dimension and orientation are directly predicted by the CNN module. To successfully train this complex MultiNet architecture, a combination and modification of current loss functions is proposed. The fastest of the proposed network module combinations is capable of detecting objects in 3D camera coordinates at a frame rate of 28 fps.