Top 7 Checkpoints To Consider During Machine Learning Production

A major challenge for any company that is starting out in the realm of data-driven markets is the deployment of machine learning pipelines at full scale for their products.

To tap the most out of AI, it is necessary to build service-specific tools and frameworks in addition to the existing models. The best strategy varies from product to product; but the rubrics of machine learning stay the same.

To democratise the use of machine learning, Google has condensed their years of research into a paper titled “A Rubric for ML Production Readiness”, where they listed out their findings in the form of 28 specific tests that have shown promising results. Here are a few of those best practices that one can imbibe into their work culture:

Offline Proxy Metrics

The offline/online metric relationship can be measured in one or more small scale A/B experiments using an intentionally degraded model.

A machine learning system is trained to optimise loss metrics, such as log-loss or squared error. A strong understanding of the relationship between these offline proxy metrics and the actual impact metrics is needed to ensure that a better scoring model will result in a better production system.

Checking For Model Staleness

Many production ML systems encounter rapidly changing non-stationary data. Examples include content recommendation systems and financial ML applications. For such systems, if the pipeline fails to train and deploy sufficiently up-to-date models, the model can be called stale. Understanding how model staleness affects the quality of predictions is necessary to determine how frequently to update the model.

If predictions are based on a model trained yesterday versus last week versus last year, what is the impact on the live metrics of interest? Most models need to be updated eventually to account for changes in the external world; a careful assessment is important to decide how often to perform the updates.

One way of testing the impact of staleness is with a small A/B experiment with older models. Testing a range of ages can provide an age-versus-quality curve to help understand what amount of staleness is tolerable.

Testing for ML Infrastructure

Model training is often not reproducible in practice, especially when working with non-convex methods such as deep learning or even random forests. This can manifest as a change in aggregate metrics across an entire dataset, or, even if the aggregate performance appears the same from run to run, as changes on individual examples.

Random number generation is an obvious source of non-determinism, which can be alleviated with seeding. But even with proper seeding, initialisation order can be underspecified so that different portions of the model will be initialised at different times on different runs leading to non-determinism.

Unit tests should run quickly and require no external dependencies but model training is often a very slow process that involves pulling in lots of data from many sources.

It’s useful to distinguish two kinds of model tests: tests of API usage and tests of algorithmic correctness.

A simple unit test to generate random input data, and train the model for a single step of gradient descent is quite powerful for detecting a host of common library mistakes, resulting in a much faster development cycle. Another useful assertion is that a model can restore from a checkpoint after a mid-training job crash.

Testing correctness of a novel implementation of an ML algorithm is more difficult, but still necessary – it is not sufficient that code produces a model with high-quality predictions, but that it does so for the expected reasons.

Testing For ML Pipeline Integration

Although a single engineering team may be focused on a small part of the process, each stage can introduce errors that may affect subsequent stages, possibly even several stages away. That means there must be a fully automated test that runs regularly and exercises the entire pipeline, validating that data and code can successfully move through each stage and that the resulting model performs well.

The integration test should run both continuously as well as with new releases of models or servers, in order to catch problems well before they reach production. Faster running integration tests with a subset of the training data or a simpler model can give faster feedback to developers while still backed by less frequent, long-running versions with a setup that more closely mirrors production.

Internal Hyperparameter Tuning Service

Having an internal tuning service would not only improve prediction quality but would also help in uncovering hidden insights from the data.

Methods such as a grid search or a more sophisticated hyperparameter search strategy is recommended.

Ensuring A Quick Roll-back

Being able to quickly revert to a previously known-good state is as crucial with ML models as with any other aspect of a serving system. Because rolling back is an emergency procedure, operators should practice doing.

To mitigate the new-model risk more generally, one can turn up new models gradually, running old and new models concurrently, with new models only seeing a small fraction of traffic, gradually increased as the new model is observed to behave sanely.

Notifying Dependency Changes

It can be difficult to effectively monitor the internal behaviour of a learned model for correctness, but the input data should be more transparent. Consequently, analysing and comparing data sets is the first line of defence for detecting problems where the world is changing in ways that can confuse an ML system.

Partial outages, version upgrades, and other changes in the source system can radically change the feature’s meaning and thus confuse the model’s training or inference, without necessarily producing values that are strange enough to trigger other monitoring.

In practice, careful tuning of alerting thresholds is needed to achieve a useful balance between false positive and false negative rates to ensure these alerts remain useful and actionable.

For example at LinkedIn, the blueprint of a machine learning model would more or less consist of the same procedures; data collection, processing, training and testing the models and so on.

The major part of data or to be more precise, the most crucial data with respect to LinkedIn is based on the kind of jobs like jobs saved and connections made.

At LinkedIn, the ML team proceeds by building a domain-specific language (DSL) and then a Jupyter notebook to integrate the select features and for parameter tuning.

Most of the model training occurs offline where the ML teams train and retrain the models every few hours. For this, they avail the services of Hadoop. LinkedIn’s own Pro-ML training service is updated with newer model types for hyperparameter tuning. This training service leverages Azkaban and Spark to ensure that there is no missing input data.

Every machine learning lifecycle starts with a business problem. It can be about the increase in sales, make predictions, cutting costs or whatever it is that brings profits to the organisation. Once a business problem is classified under artificial intelligence or machine learning, questions like if the whole ML thing profitable or whether the output maintains ML fairness and other such potential challenges.

Answering these questions is crucial if an organisation has to meet the growing demands of the customers.

The post Top 7 Checkpoints To Consider During Machine Learning Production appeared first on Analytics India Magazine.