Andrej JandaMaster's Student (2022)
Maintaining an accurate representation of the environment is necessary for many tasks in robotics, such as navigation, obstacle avoidance, and scene understanding. The two most common scene representations, particularly for scene understanding tasks, are images, and point clouds. Images contain dense, feature-rich information but lack knowledge about distances and object sizes. Objects in images are also prone to occlusion. Modelling the 3D world directly with point clouds circumvents many of the limitations inherent to images. However, point clouds present their own challenges. Particularly, in contrast to images, point clouds are significantly harder to annotate. The difficulty of annotating 3D data has resulted in considerable effort and labelling times for existing datasets. A successful approach to reducing reliance on annotations is self-supervised learning. Self-supervision leverages unsupervised training on a large unlabelled dataset to initialize the parameters of a given model, which is subsequently trained with supervised annotations on a downstream task. Previous work has focused on self-supervised pre-training with point cloud data exclusively, which neglects the information-rich images that are often available as part of 3D datasets.
Andrej investigated a pre-training method that leverages images as an additional modality, by learning self-supervised image features that can be used to pre-train a 3D model. An advantage of incorporating visual data into the pre-training pipeline is that only a single point cloud scan and the corresponding images are required during pre-training. Despite using single scans, Andrej’s method performs competitively with approaches that use overlapping point cloud scans. Notably, his method yields more consistent performance gains than other, related algorithms.