Self-Supervised Learning of Dense Visual Descriptors

Tanner Schmidt1, Richard Newcombe1, Dieter Fox1

  • 1University of Washington

Details

10:30 - 10:35 | Tue 30 May | Room 4311/4312 | TUA3.8

Session: Computer Vision 1

Abstract

Robust estimation of correspondences between image pixels is an important problem in robotics, with applications in tracking, mapping, and recognition of objects, environments, and other agents. Correspondence estimation has long been the domain of hand-engineered features, but more recently deep learning techniques have provided powerful tools for learning features from raw data. The drawback of the latter approach is that a vast amount of (labelled, typically) training data is required for learning. This paper advocates a new approach to learning dense image correspondences in which we harness the power of a strong 3D generative model to automatically label correspondences in video data. A fully-convolutional network is trained using a contrastive loss to produce viewpoint- and lighting-invariant features. As a proof of concept, we collected two datasets: the first depicts the upper torso and head of the same person in widely varied settings, and the second depicts an office as seen on multiple days with objects re-arranged within. Our datasets focus on re-visitation of the same objects and environments, and we show that by training the CNN only from local tracking data, our learned visual descriptor generalizes towards identifying non-labelled correspondences across videos.