Monitoring Data Drift using NannyML

joydeepml2020
Aug 2, 2023
3 min read

The life cycle of any ML project or ML product is given in the Figure. Most of our focus in ML project hover around understanding the business problem and getting the data or identifying the data sources and then creating data pipelines, data pre-processing, feature engineering and model building and model deployment. The success of any ML project or product really depends on the evaluation of the performance of models in the applied business context.As Data Scientist, we tend to ignore the importance of model monitoring but model monitoring is a critical component in the design of ML system. As ML product development is an iterative and monitoring, re-training the models(if need) is critical for the success of the ML products. In this article, we will explore a open source library named NannyML.

This library can be a part of the pipeline and can perform the following task:

-Performance Estimation and calculation

Using performance estimation algorithms users can track classification metric like accuracy AUC or regression metric like MSE and RMSE of the predictions of live data, when the ground truth is not able

- Business Value Estimation and Calculation

NannyML uses business value metric estimation using CBPE algorithms which provides a way to tie the performance of a model to business-oriented outcomes. As on today, it is best suited for classification task

- Data Quality

It helps testing the data quality. It allows us identify missing and unseen values in categorical columns. These changes are plotted over time to help user better understand the data.

- Detecting data drift

NannyML can help to identify or trigger alert for both multivariate and univariate feature drift. For multivariate its uses PCA based data reconstruction techniques. For univariate data drift detection it supports the following method

- Jesen-Shannon

- Hellinger

- Wasserstein

- L-infinity

- Kolmogorov-Smirnov

- Chi-2

- Custom Thresholds

This features provides alarm when model's performance exceeds the upper threshold or drops below the lower threshold. Adding more flexibility, the developer can have has constant threshold or standard deviation based threshold.

We are going to discuss data drift in elaborated way in this article. NannyML really provides simple and elegant way to detect data drift. Data drift is the indication that there is change in the distribution of the data. NannyML provides methods to detect univariate and multivariate data drift methods. let us understand briefly about the methods of univariate data drift. The univariate data drift methods are classified into two classes based on the variable type. For continuous variable, the following methods are used

K-S test(Kolmogorov-Smirnov) Test is a two sample non parametric statistical test. It is used to measure the similarity of continuous distribution. For the test, the test statistics is the max(distance) between two distribution. K-S test is not differentiable. It is a distance measure between two distribution.
Jensen-Shannon Distance is the metric to find the similar between two distribution and it is based on Kullback-leibler divergence. Jensen_shannon modified the KL divergence and Jensen-shannon is a method of measuring the similarity of two distribution and Jensen - shannon distance is the measure
Wasserstein distance, also known as eather's mover distance and this is measure of the difference between two probability distribution. It is the integral of the absolute values of the difference between the two CDFs or the area under the CDFS.
Hellinger Distance measures the over lap area of two distribution

The above methods are used to detect the data drift in continuous variable and Chi-square Test, Jensen_shannon distance,hellinger distance, L-infinity Distance methods are used for tracking the univariate analysis of categorical features.

As the univariate data drift detections are segregated based on the type of the variable(continuous and categorical) these methods can be easily incorporated in the productions pipelines.

NannyML also helps in detecting multivariate data drift. Before diving into multivariate data drift, let us understand why we need multivariate data drift in the first place.

Univariate techniques are good when we have shift in distribution of a single feature but in reality the machine learning models has multiple features and we perform multi-variate analysis and the shift in relationship like co-relation changes can't be detected by univariate.

When there are internal changes in higher dimensions whose internal shift in relation can't be detected by univariate methods and hence we need multivariate drift detection techniques. NannyML uses PCA based reconstruction errors to detect the drift in global structure of the data.

In the next article, we will take one simple example (cooked) to understand how we can implement methods for detecting univariate and multivariate data drift.

Ref :https://nannyml.readthedocs.io/en/stable/tutorials.html