An Overview of Model Drift in Machine Learning

Sabyasachi Ghosh
6 min readNov 1, 2022

--

Data drift , Concept drift & Model retraining Why is it required?

Photos by Danny Sleeuwenhoek on Unsplash
Photos by Danny Sleeuwenhoek on Unsplash

Change is Constant:

The world is dynamic, and data is constantly changing be it on volume, quality, integrity or scale. Machine Learning models trained on todays data may not hold good for tomorrow.

Machine learning models are made or trained based on given or historical data. The performance of an ML model is as good as the data it is trained on, but these models become obsolete when the data they are modelled upon changes or the model loses its predictive power. Further, a change in the environment changes the relationship between the model’s variables. Thus, there is a need to regularly monitor models in real-time.

Inaccurate models can be costly for businesses. Over time, even highly accurate models are prone to decay as the incoming data shifts away from the original training set. This phenomenon is called model drift.

Procedure to Predict Model Drift :

Instance prediction:

Detection of data drift is not possible, because it is not possible to summarize only one data point because it cannot represent a population and therefore we cannot summarize.

Batch Prediction:

We can save each requests or data points in Database at a specified time interval say every second , then fetch the data from database after few hours, thus we will have a batch of data which represent a substantial number of data points and we will be in a position to detect data drift.

The first analysis we should run is assessment of our data’s distribution. Our training dataset was a sample from a particular moment in time, so it’s critical to compare the distribution of the training set with the new data/incoming data to understand what shift has occurred. If there is a huge difference in data then we can send alert to the responsible team and go ahead for model retraining .

In real time , below are some scenarios how a data drift can happen, Its very important to judge and act upon at correct time.

Trends of Data to detect Model drift :

Different types of drift can be encountered in real life scenario , it could be abrupt, gradual, incremental, recurrent, or blip. It critically depends on which scenario we should reject/accept or retrain the data .

Image contents KD Nuggets

Types of Drift in Machine Learning :

credit : fiddler.ai

Broadly classified into two types:

  1. Concept Drift : Concept drift occurs when the patterns the model learned no longer hold. the relationship between the models input and output changes.

Below example : is an example of a abrupt concept drift, before lockdown the predicted and actual sales of loungewear was matching but due to an event of coronavirus lockdown , suddenly the sales of loungewear skyrocketed ,because people were spending a lot of time in home and were using less formal wear. So now in this case the model may need retraining with additional features to establish relationship between input and output variables.

credit : image from ubiops

2. Data Drift: It is the type of model drift where the underlying distributions of the features have changed over time. This can happen due to many causes, such as seasonal behavior or change in the underlying population.

Below example : is an example of data drift, wherein the training set consisted mostly of women, The model learned from the training set that it was mostly women who spend more.

After some time, the webshop is more often visited by men than by women. Since the model had limited examples to learn from in the training set, it is harder for the model to try and predict what men will spend on average. Thereby we can say that the distribution of data points the model was trained on has changed.

credit : image from ubiops

a)Target Drift : Shift in output Data P(Y) changes.

b) Feature Drift: Shift in input data P(X) changes

How to detect Model Drift :

a) Supervised learning : Data with labels

If you have labeled data, model drift can be identified with performance monitoring and supervised learning methods. We can start with standard metrics like accuracy, precision, False Positive Rate, and Area Under the Curve (AUC).

b) Unsupervised learning : Data without labels

it’s critical to compare the distribution of the training set with the new data to understand what shift has occurred. There are a variety of distance metrics and nonparametric tests that can be used to measure this, including the Kullback-Leibler divergence, Jenson-Shannon divergence, and Kolmogorov-Smirnov test.

Root cause of ML model drift :

There are two main causes:

Reason behind Data Drift

Tools used for detecting Data Drift:

1.Simple statistical test :

ks2samp test from SciPy library : The Kolmogorov-Smirnov test, allows us to compare two samples, and tells us the chance they both come from the same distribution or not.

if the P-value out of the ks2 test is lower than the threshold of 0.05,we can reject the null hypothesis and conclude that the data were not drawn from same distribution.

from scipy.stats import ks_2samp
import matplotlib.pyplot as plt
import numpy as np
training_data=np.arange(10) # data on which model was trainednew_batch_data=np.arange(100,200)# data newly acquired after driftresult = ks_2samp(training_data, new_batch_data)

The result was :

KstestResult(statistic=1.0, pvalue=4.2646072253826637e-14)

As the the P-value out of the ks2 test is lower than the threshold of 0.05,we can reject the null hypothesis and conclude that the data were not drawn from same distribution.

Further example :

If the training and the new batch test data is from different distribution, the p-value sis very low ~ 0. As depicted below, thus indicates that there is a data drift and the model might need retraining :

KstestResult(statistic=0.68548, pvalue=0.0)

If the training and the new batch test data is from same distribution, the p-value sis very high >0. 05 ,as depicted below thus indicates no data drift with incoming batch data.

KstestResult(statistic=0.002599999999999991, pvalue=0.8870344462926139)

2. evidently :

Link to the tool :

https://www.evidentlyai.com/blog/ml-monitoring-data-drift-how-to-handle

Evidently is an open-source Python library for data scientists and ML engineers. It helps evaluate, test, and monitor the performance of ML models from validation to production.

An example of the report is as shown below:

credit :evidentlyai.com
credit :evidentlyai.com

References:

  1. evidentlyai:https://www.evidentlyai.com/blog/machine-learning-monitoring-data-and-concept-drift#what-else-can-go-wrong
  2. ubiops : https://ubiops.com/an-introduction-to-model-drift-in-machine-learning/
  3. Analyticsvidya :https://www.analyticsvidhya.com/blog/2021/10/mlops-and-the-importance-of-data-drift-detection/
  4. https://analyticsindiamag.com/what-are-the-ways-to-automate-model-drift/
  5. fiddler.ai :https://www.fiddler.ai/blog/drift-in-machine-learning-how-to-identify-issues-before-you-have-a-problem

--

--