It is quite common to have machine learning models to perform well after training, but when deployed can perform poorly. This is also known as underspecification, where models exhibit unexpected behaviour when deployed in real-world scenarios.

For instance, computer vision (CV) models can show surprising sensitivity to irrelevant features. Meanwhile, natural language processing (NLP) models can depend unpredictably on demographic correlations not directly indicated by the text. The reasons for these hiccups are well-known, training ML models on badly curated data or training ML models to solve prediction problems that are structurally dissimilar/contrasting with the application domain. However, even though these known problems are handled, they can still be inconsistent in deployment.  

Throwing light on some of the key failure modes, Google researchers recently released a paper called ‘Underspecification Presents Challenges for Credibility in Modern Machine Learning,’ published in the Journal of Machine Learning Research. In this study, the researchers show that underspecification appears in a wide variety of practical ML systems and suggest some strategies for mitigation. 

What is underspecification? 

Underspecification refers to a difference between the requirements that practitioners often have in mind when they build a machine learning model and the needs enforced by the ML pipeline. For example, this could be the design and implementation of a model. 

Even if the pipeline could return a model that meets all of these requirements, there is no guarantee that the model will satisfy any requirement beyond accurate prediction on head-out data. The model that is returned may have features/properties that instead depend on inconsistency made in the implementation of the ML pipeline, arising from random initialisation seeds, data ordering, hardware, etc. 

How to identify underspecification in the real world? 

To identify underspecification in real-world applications, the Google researchers constructed sets of models using nearly identical ML pipelines. However, they only applied small changes that had no practical effect on standard validation performance and focussed on the random seed used to initialise training determined data ordering.

“If important properties of the model can be influenced by these changes, it indicates that the pipeline doesn’t fully specify this real-world behaviour,” said the researchers, stating that in every domain where they conducted this experiment, they found that these minute changes induced substantial variation on axes that matter in real-world scenarios. 

Computer Vision

The researchers said that a central challenge in computer vision (CV) is that deep models often suffer from breaking easily under distribution shifts that humans do not find challenging. For example, image classification models that perform well on the ImageNet benchmark are known to perform badly on benchmarks like ImageNet-C, which apply common image corruptions, such as pixelation or motion blur, to the standard ImageNet test set.  

Researchers showed that standard pipelines underspecified model sensitivity to these corruptions. Following the strategy, they generated fifty ResNet-50 image classification models using the same pipeline and same data. Here, the only difference between these models was the random seed used in training. 

After being evaluated on the standard ImageNet, these models achieved practically equivalent performance. But, when the models are evaluated on different test sets in the ImageNet-C benchmark (i.e. on corrupted data), the performance on some tests varied by orders of magnitude more than on standard validations. This pattern continued for larger-scale models (BiT-L) pre-trained on much larger datasets (300 million image JFT-300M dataset). By varying the random seed at the fine-tuning stage of training, the researchers noted that they produced a similar pattern of variations (as shown below). 

The Risk Of Underspecification In ML
(Source: Google

In addition to this, the researchers showed that underspecification could have practical implications in special-purpose computer vision models built for medical imaging/applications – 

  1. Ophthalmology pipeline for building models that detect ‘diabetic retinopathy’ and referable diabetic macular edema from ‘retinal fundus’ images 
  2. Dermatology pipeline for building models to recognise common dermatological conditions from photographs of skin 

For the ophthalmology pipeline, the researchers tested how models trained with numerous random seeds performed when implemented to images taken from a new camera. The stress test was similar for the dermatology pipeline, but for patients with different estimated skin types, particularly non-dermatologist evaluated tone and response to sunlight. 

The Risk Of Underspecification In ML
(Source: Google)

In both cases, the researchers found that standard validations were insufficient to fully specify the trained model’s performance on these axes. Thus, the standard hold-out testing alone is not sufficient to ensure acceptable model behaviour in medical applications. 

There is a need for underscoring expanded testing protocols for ML systems intended for application in the medical domain. In medical literature, such validations are termed ‘external validation’ and have traditionally been part of reporting guidelines like STARD and TRIPOD. These are being emphasised in updates such as STARD-AI and TRIPOD-AI.

Other Applications 

Besides computer vision, other cases that Google researchers examined include: 

  • NLP: They showed a variety of NLP tasks; underspecification affected how models obtained from BERT-processed sentences. For instance, depending on the random seed, a pipeline could produce a model that depends more or less on correlations involving gender when making predictions. 
  • Acute kidney injury (AKI) prediction: Here, the researchers said underspecifications affect reliance on operational vs physiological signals in AKI prediction models found on (EHR) electronic health records. 
  • Polygenic risk scores (PRS): The researchers showed that underspecification influences the ability of PRS models, when predicting clinical outcomes based on patient genomic data, to generalise across different patient populations. 

Google researchers said that important properties were left ill-defined by standard training pipelines in each of these cases, making them sensitive to seemingly innocuous choices. 

Wrapping up 

Solving underspecification is a cumbersome process. It requires complete specification and testing of requirements for a model beyond standard predictive performance. Doing so needs full engagement with the context in which the model will be used, understanding how the training data is collected, and incorporating domain expertise when the available data falls short. 

Google researchers said that these areas of ML system design are often overlooked in ML research today. Therefore, some of the important steps in this area are to specify stress testing protocols for any applied ML pipelines meant to see real-world use cases/applications. After these criteria are codified in measurable metrics, several different algorithmic strategies may improve them. This includes data augmentation, pretraining, and incorporation of causal structure. 

The team said that ideal stress testing and improvement processes would usually require iteration — both the requirements of ML systems and the world in which they are used are constantly changing.