Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Microsoft, and Cornell University have developed a model to discover and localise semantically meaningful categories within image corpora without any annotation. 

The ML model, STEGO (Self-supervised Transformer with Energy-based Graph Optimisation), produces features for each pixel that are semantically meaningful and compact enough to generate distinct clusters. As a result, the model can detect and delineate objects at a finer granularity than classification or object detection systems. 

STEGO architecture

STEGO uses a semantic segmentation method- the process of assigning a label to every pixel in an image. The model is built on top of the DINO algorithm, which learned about the world through 14 million images from the ImageNet database. STEGO refines the DINO algorithm through a learning process that emulates how human brains piece together information to make sense of the world.

STEGO architecture

Source: arxiv.org

STEGO consists of a frozen backbone that provides a source of learning feedback and helps predict distilled features. The segmentation head is a simple feed-forward network with rectified linear unit activation functions. The backbone remains un-trained, making it easy to train the model efficiently. For example, an NVIDIA V100 GPU card can be trained in under two hours.

The STEGO architecture consists of a frozen backbone (neural network) that extracts global image features by global average pooling (GAP)

spatial features: GAP(f). Then, a lookup table is made of each image’s K-Nearest Neighbors based on the cosine similarity in the backbone’s feature space. Each training minibatch consists of a collection of random images x and random nearest neighbours x sample x knn randomly from each image’s top 7 KNNs. The team also sampled random images, xrand, by shuffling x and ensuring that no image matched itself.

 The resultant model’s full loss becomes:

L = ?selfLcorr(x, x, bself ) + ?knnLcorr(x, xknn, bknn) + ?randLcorr(x, xrand, brand)

       |___________|   |_____________|    |_____________|

                 Self                                  K nearest neighbours                      Random images

Here, ?’s and the b’s manage the balance of the learning signals and the ratio of positive to negative pressure. The b parameters tended to be dataset and network-specific, but the system is kept in a rough balance between positive and negative variables.

The images in the ‘CocoStuff’ and ‘Cityscapes’ datasets are riddled with small objects that are hard to identify at a feature resolution (40, 40). To manage the small objects while maintaining fast training times, the images are five-crop trained before learning KNNs. The technique allows the network to cover the images in detail and improves the quality of the KNNs. Five-cropping improves both Cityscapes results and CocoStuff segmentations. The final components of the architecture involve clustering and CRF refinement. STEGO’s segmentation features lean towards forming clear clusters due to its feature distillation process. To counter this effect, a cosine distance-based minibatch K-Means algorithm is applied to extract these clusters and compute concrete class assignments from STEGO’s continuous features. The clustered labels are then refined with a CRF to further improve spatial resolution.

Performance

STEGO offers better results on linear probe and clustering (unsupervised) metrics across various datasets than SOTA models like PiCIE, Deep Cluster, InMars, etc. Despite the backbone of these two datasets not being fine-tuned, DINO’s self-supervised weights on ImageNet are enough to resolve both settings simultaneously. STEGO also proves superior in simply clustering the features from unmodified DINO, MoCoV2, and ImageNet supervised ResNet50 backbones.

performance table

Source: arxiv.org

Conclusion

The researchers showed modern self-supervised visual backbones can be refined to yield state of the art unsupervised semantic segmentation methods. However, despite the latest improvements, STEGO still faces bottlenecks like labelling issues.  

Using an unsupervised methodology, the researchers plan to take a large corpus of images and classify each pixel into an accurate and consistent ontology of objects.