All you need are a few pixels: semantic segmentation with PixelPick

With only 10 labelled pixels per image, segmentation models trained with PixelPick achieve promising visual quality, which further improves to capture fine details (e.g. the cleanly segmented thin lamppost in the bottom right image) as further labelled pixels are used.

A central challenge for the task of semantic segmentation is the prohibitive cost of obtaining dense pixel-level annotations to supervise model training. In this work, we show that in order to achieve a good level of segmentation performance, all you need are a few well-chosen pixel labels. We make the following contributions: (i) We investigate the novel semantic segmentation setting in which labels are supplied only at sparse pixel locations, and show that deep neural networks can use a handful of such labels to good effect; (ii) We demonstrate how to exploit this phenomena within an active learning framework, termed PixelPick, to radically reduce labelling cost, and propose an efficient “mouse-free” annotation strategy to implement our approach; (iii) We conduct extensive experiments to study the influence of annotation diversity under a fixed budget, model pretraining, model capacity and the sampling mechanism for picking pixels in this low annotation regime; (iv) We provide comparisons to the existing state of the art in semantic segmentation with active learning, and demonstrate comparable performance with up to two orders of magnitude fewer annotations on the CamVid, Cityscapes and PASCAL VOC 2012 benchmarks; (v) Finally, we evaluate the efficiency of our annotation pipeline and its sensitivity to annotator error to demonstrate its practicality. Our code, models and annotation tool will be made publicly available.

Given a database of unlabelled pixels of interest (top-left) each image is fed to a segmentation model to produce pixel-wise class probabilities (top-middle), which are in turn passed to an acquisition function to estimate per-pixel uncertainties and select a batch of \(B\) pixels to be labelled (top-right). The queries are sent to annotators (bottom-right), and the resulting labels are added to the labelled pixel database, \(D_L\) (bottom-middle). Finally, the segmentation model is retrained on the expanded database (bottom-left), before the cycle repeats. To bootstrap the process and train the initial segmentation model, we randomly sample \(B\) pixels and send them to be annotated.

We show that deep neural networks can obtain remarkable performance with just a handful of labelled pixels per image whose spatial coordinates are proposed by the model, rather than the human annotator. We compare our approach, PixelPick, with existing active learning and semi-supervised approaches on the CamVid (left) and Cityscapes (right).

Annotators are asked to classify the highlighted point (in red) by pressing the keyboard character of the corresponding class for the dataset. The tool then highlights the next pixel proposal and the process repeats.

Abstract

How does it work?

Results on CamVid and Cityscapes benchmarks

PixelPick mouse-free annotation tool

Paper and Supplementary Material

Acknowledgements