Metrically-Scaled Monocular SLAM Using Learned Scale Factors

W. Nicholas Greene1, Nicholas Roy2

  • 1MIT
  • 2Massachusetts Institute of Technology

Details

09:15 - 09:30 | Mon 1 Jun | Room T2 | MoA02.1

Session: SLAM I

17:45 - 18:00 | Mon 1 Jun | Room T1 | MoD01.5

Session: Awards IV – Unmanned Aerial Vehicles, Robot Vision

Abstract

We propose an efficient method for monocular simultaneous localization and mapping (SLAM) that is capable of estimating metrically-scaled motion without additional sensors or hardware acceleration by integrating metric depth predictions from a neural network into a geometric SLAM factor graph. Unlike learned end-to-end SLAM systems, ours does not ignore the relative geometry directly observable in the images. Unlike existing learned depth estimation approaches, ours leverages the insight that when used to estimate scale, learned depth predictions need only be coarse in image space. This allows us to shrink our network to the point that performing inference on a standard CPU becomes computationally tractable. We make several improvements to our network architecture and training procedure to address the lack of depth observability when using coarse images, which allows us to estimate spatially coarse, but depth-accurate predictions in only 30 ms per frame without GPU acceleration. At runtime we incorporate the learned metric data as unary scale factors in a Sim(3) pose graph. Our method is able to generate accurate, scaled poses without additional sensors, hardware accelerators, or special maneuvers and does not ignore or corrupt the observable epipolar geometry. We show compelling results on the KITTI benchmark dataset in addition to real-world experiments with a handheld camera.