Adoption of machine learning (ML) methods across all industries has drastically increased over the last few years. Starting from a handful of ML models, companies now find themselves supporting hundreds of models in production. Operating these models requires the development of comprehensive capabilities for batch and real-time serving, data management, uptime, scalability and many other aspects.
Developing a unified observability framework that enables the ability to monitor production model deployments, easily add new models, detect data issues and feature drifts, and ensure stability, correctness, and trustworthiness of ML solutions are key challenges in MLOps. In this blog post, we provide guidance on how such a framework can be designed and implemented in AWS using cloud-native services and open-source libraries.
Managing machine learning models in production has never been easy, and involves having to monitor model runtime, A/B testing scenarios, model uptime, critical issues, and data accuracy, as well as other meaningful parameters.
Challenges are not only connected to ML model management, but also to data engineering, platform engineering, monitoring and managing the microservice layer that triggers the ML model inference layer. Observability comes into play as the practice of connecting DataOps, MLOps and DevOps together to build a comprehensive monitoring framework that tracks how changes in the data can affect models, and introduce model drifts.
Talking about observability, the figure below depicts the comparable complexity of building a monitoring system for microservices, data pipelines and ML models:
Despite some similarities at first glance, the elements for each layer are quite different upon closer inspection. Model observability requires deep data science knowledge to be able to manage it as a scientific task. Moreover, it includes game theory, deep learning, and regulatory compliance, making the task even more challenging.
Model observability can be viewed as quite similar to classic model monitoring: monitoring model accuracy, input and output. However, once performance metrics are calculated, a critical factor remains: identifying the reasons for performance degradation, gained from deep insights using explainability methods. Thus, classic model monitoring is in fact only one aspect of observability, albeit a major one. Performance metrics alone might be biased and can lead to skewed outcomes for particular groups of people, and usually require additional investigation, not only in the model code base, but also in the data. The reasons for skewed outcomes can be linked to significant changes in features, or actual user behavior. In this case, if we have tabular data for observability, we can easily detect outliers by using cloud-managed services. On the other hand, for unstructured data like images and text, it becomes a much harder task, as displayed in the figure below:
The major indicators of potential issues are identified as violations in the metrics, called “drifts”. The concept of drift is fundamental to the model observability process and model specificities: models working with tabular data, images, videos, and text each have their own specificities.
Methods for validating model inputs are extremely well-developed for tabular data, but the range of techniques available for image, video, and text data is much more limited. In this article, we focus only on the input tabular data. However, the observability methods that validate the model outputs (such as accuracy drift) can be applied to any classification or regression models regardless of their input type.
The base data engineering rule says: garbage in, garbage out. This applies particularly to machine learning since the quality of the model is exactly the same as the quality of the data used in training. But even after the training is completed, data are still used in production as on-line or batch feature stores helping to fetch customer profiles, history of transactions for previous year, or number of speed fines. Data are a core part of the whole model lifecycle, and the quality of the data plays a vital role in how good the models are. In this article, we’ll discuss data drifts as changes in data distribution, considering base data quality issues are solved in the DataOps stage.
Data drifts are usually changes of data distribution; for example, if the target audience of an eCommerce site grows from one country to a region, and recommendations related to geographical signals like “customer who bought this might also be interested in buying an umbrella”, might be irrelevant to those who live in northern countries. Yet another example is when data preprocessing to create embeddings for image data used in classification or recommendations evolves over time, and brand new smart devices are incorrectly classified.
Industry recommendations to track data drifts are:
A well-known truth is that data changes over time. When a person’s data changes, it can cause schema or data type changes. For instance, data type change, or a negative transaction amount causes issues in downstream systems. Another common case for practitioners is cardinality changes, when a sudden shift in category distribution occurs. For example, bachelor-cohort sales can affect recommendations for other customer cohorts.
General recommendations to track data quality are continuous historical checks of data assets during the evolution process, from raw to golden datasets and features for ML models, tracking of cardinality distribution, schema evolution, and base numeric/out-of-range checks.
Bias is a prejudice in favor or against a person, group, feature, or behavior that is commonly considered unfair. Nowadays when machine learning algorithms make decisions on behalf of humans, whether it’s issuing a credit score, calculating insurance, or providing recommendations, ML models may behave unfairly toward certain groups.
Machine learning model fairness is an algorithmic approach to correct algorithmic bias in decision-making processes. For regulated industries like financial services or insurance, it’s critical for models to avoid biased predictions: credit offerings should not depend on gender, age, or other aspects that might be discriminative.
Each case of fairness adoption is unique, depending on the regulatory norms, business goals, and technical capabilities. We’ll briefly mention a few of them:
Explainability is a method of explaining model behavior in a human way. For linear regressions or gradient boosting algorithms it might not be complicated, but it would be challenging for neural networks and other types of black-box models. Even though such models can’t be explained fully, there are model-agnostic methods like SHAP (SHapley Additive exPlanation) which plot the dependencies or discover meaningful patterns between data attributions and model outputs. For instance, if you’re developing a credit score model which is heavily dependent on age - it might underestimate younger age groups and overestimate middle age groups. Explainability helps to identify features that are used heavily in the decision-making process, and eliminate them during the development phase. In the example above, age will be highlighted as a primary feature used in the model scoring process, forcing the data science team to check the correctness of the implementation.
Yet another scenario, very common in our practice, is when a future version of an ML model has worse accuracy than the previous one. This usually happens when the previous version of the model used one feature set, and the next one uses another. Explainability helps to identify the case and ensure that the feature set change is legitimate. During the training, explainability helps build confidence in the features that were chosen for the model, ensuring that the model is unbiased, and uses accurate features for scoring. There are various techniques like SHAP, kernel SHAP or LIME, where SHAP aims to provide global explainability, and LIME attempts to provide local ML explainability.
Never has model performance analysis been an easy thing: many implementations require monitoring vast amounts of metrics. However, over the last few years, a new framework for performance monitoring has been developed. Performance measurements start with building ground truth. Ground truth in machine learning refers to the reality you want to model with your supervised machine learning algorithm. Ground truth is also known as the target for training or validating the model with a labeled dataset. The ideal scenario is depicted in the figure below:
Matching ground truths may be compared from a performance standpoint, and common metrics such as precision, recall, accuracy, F1 and MAE are tracked with weekly or bi-weekly cadence to check whether performance is degrading or not.
Ground truth metrics or labels in a large amount of cases are collected manually, and sometimes are very hard to get. In cases when ground truth is not available, a general recommendation is to gather performance metrics with some cadence, and review them periodically to avoid unexpected performance drop.
Solving this task requires data engineering and MLOps engineering. Therefore, we consider using batch or streaming jobs within AWS Sagemaker. AWS Sagemaker provides services such as AWS Model Monitor and AWS Clarify that are fully integrated with AWS Sagemaker endpoints for online inference, and AWS Sagemaker Batch transformation jobs for offline inference.
First, every evaluated metric has baseline values and constraints to identify model degradation in time. To address this task, AWS Sagemaker offers some baselining jobs to produce JSON-based artifacts as one of the input values for observability jobs. Besides that, as outlined above, ground truth labels are collected to calculate bias and performance metrics. This data, along with production data from the machine learning endpoints, enables evaluation of the actual behavior of the model in production, and detection of bias and performance metrics degradation at times. Finally, it’s important to note that there is only one difference between real-time endpoints and batch transformation jobs in the production data capture method to S3 bucket.
The AWS reference architecture is depicted in Figure 5:
AWS provides four types of monitoring jobs as part of the AWS Model Monitor and AWS Clarify services. Each job requires production data in the form of requests to the deployed AWS Sagemaker Endpoints and predictions. AWS Sagemaker Endpoints offers built-in functionality to capture and store data, and sends serialized aggregated requests and outcomes data to S3 with metadata such as timestamp and inference identifier. In the case of unstructured data like text or images, you have the option to keep the content whole by using the special flag in the endpoints configuration.
Having configured AWS Monitoring jobs, which run on top of Amazon Elastic Container Service, you can schedule it with a minimum interval of one hour. Let us now inspect each of the job types in detail.
This job checks data quality metrics and statistics against baseline values to detect data distribution changes. To accomplish this, the job uses the open-source library Deequ, built on top of the Apache Spark framework. It enables effective management of big data.
Despite that, performance issues can still occur with statistics calculation and data distribution evaluation. In this case, the AWS Sagemaker model quality job uses a KLL sketches approach, which calculates the approximate values for data distribution.
Usage: The Data Drift Job is suitable for all kinds of tasks with tabular data to measure input and output data distribution, and perform basic quality checks. It can also be used with existing pre-training data quality checks when the Bring Your Own Containers (BYOC) approach is followed.
This job calculates pre-training and post-training fairness metrics against sensitive and important features, and their particular values called “facets”.
First, as a part of the EDA analysis, we evaluate facet balance. Equal representation of facets can be calculated across the training data, irrespective of labels, subsets of the data with positive labels only, or over each label separately.
Existence of bias among feature facets results in feature facets being sent to either advantaged or disadvantaged groups. Detection of bias at this stage signifies that bias will exist in model predictions. The available metrics for bias detection, with detailed descriptions, are shown in Figure 6:
Usage of pre-training metrics: The bias metrics calculation is suitable for regression and classification tasks with tabular data. This AWS Sagemaker job can be included in the existing data evaluation process before training.
Having trained a model, we can detect bias in production data: features, predicted and observed labels. In this case, AWS Clarify suggests picking metrics from a rich set of available metrics, as shown in Figure 7:
Usage of post-training metrics: After performing the EDA stage, the most valuable and sensitive feature facets, in which discrimination can affect model performance, are selected, to gain business profit or meet regulations. Hence, you can schedule an AWS Clarify job to monitor bias metrics and trigger an alarm if any violations are identified. The service allows you to set up value thresholds based on business requirements. Moreover, These calculations can be easily integrated with the existing performance monitoring process because they require the same data from production. You can inspect each metric in detail in the Amazon AI Fairness and Explainability Whitepaper.
As a crucial part of explainability, this job calculates the kernelSHAP values for the production live data. It includes the top ten features sorted by increasing contribution to the predictions, and helps explain the violations in the bias or performance metrics.
Heat maps can also be built for computer vision tasks based on the KernelSHAP method.
Speaking of NLP models, AWS Clarify supports advanced functionality to analyze the contribution of various sections in the text at different levels of granularity, such as token, phrase, sentence or paragraph.
This task includes calculating the performance metrics to evaluate model quality. The task includes a predefined set for each ML problem, such as regression, binary classification and multiclass. The job requires gathering ground truth labels.
The jobs produce PDF reports and show the results in AWS Sagemaker Governance for developers and operations teams. Also, as a part of the development cycle, developers can watch the results in AWS Sagemaker Studio dashboards, divided by projects, to perform at scale. An example of a dashboard is shown in Figure 8:
Optionally, the metrics with the thresholds are sent to AWS Cloudwatch to implement alarms for Operations teams.
AWS Model Monitor and AWS Clarify provide comprehensive model observability functionality. These services are well integrated into the AWS Sagemaker ecosystem, and are a natural and modern feature for machine learning flow.
Gathering ground truth labels is a challenging task. Moreover, the labels can be unavailable or delayed. Despite that, model performance can be evaluated using novel approaches. This is possible because model performance is affected by several independent factors, such as data quality and concept drift. Therefore, performance degradation can be detected without concept drift, and without gathering target, with certain novel approaches. There are several methods implemented in the open-source python library, nannyml. The most effective are Confidence-based for classification and Direct Loss estimation for regression.
The nannyml library can also calculate performance with ground truth labels.
The original reference architecture can be enriched with advanced performance monitoring if part of the AWS Model Metrics Drift Job is replaced with the custom AWS Lambda component included in the nannyml library. This feature is shown in Figure 9:
Grid Dynamics (GD) provides the docker image with the nannyml library, which works on top of AWS Lambda. In this case, data is streamed from S3 buckets, and the results are sent to AWS Cloudwatch, AWS Governance and AWS Sagemaker Studio.
Besides the nannyml library and Sagemaker Model Monitor service, another popular open-source library, Evidently.AI, is used to evaluate model performance for regression and classification tasks. This library contains developed integrations with various MLOps tools. Choosing an appropriate library for specific cases depends on several factors, such as ground truth availability in production, monitoring frequency, integrations with existing tools, etc. Inspect the detailed diagram in Figure 10 to simplify the choice of which library to use in your specific case:
The architectural and technological landscape in modern enterprises is very different from company to company, but often has one key factor in common: almost all of them are in the cloud. Business requirements, processes and goals are different across the board as well, however, we have gathered the most common scenarios in a high level adoption plan to introduce observability in different business environments. The plan is flexible and can be adjusted to specific needs:
First of all, before starting the ML observability journey, DataOps and MLOps should be a part of the organization’s culture: orchestrated data pipelines, curated data lake, data catalog and data lineage tracking. On the ML side, ML model development should be fully automated in terms of tracking experiments, creating artifacts, deployment, and scaling. With that in place, further adoption can commence.
Modern enterprises traditionally have a full spectrum of ML models: recommendations, offers and discounts, next best action, attrition prevention, and so on. Those models should be grouped into monitoring buckets such as: bias drift, concept drift (like monitoring performance degradation), data and model score drift, or feature attribution drift.
Once the groups are defined, the next step is to classify which data require monitoring, by following the same exercise as above. Starting from datasets creation by Apache Spark, Flink or plain SQL queries on top of DWH, or modern services like AWS Athena, Glue or EMR, the data engineering team creates the required assets. The Data science team then leverages cloud services like AWS Sagemaker, GCP VertexAI or Azure ML as a sandbox to run experiments, create ML models, and define observability scenarios. Further MLOps processes include the creation of the deployment pipeline, and observability scenarios in cloud monitoring tools or external tools like Dataiku. Runtime support can be implemented on top of cloud services like Azure ML, GCP VertexAI, or Kubernetes.
Finally, the end-to-end business use case will include: features required for monitoring, models and specific attributes that require tracking, and lastly, creation of observability rules as shown in the reference architecture in Figure 5.
Once these processes are complete, the next step will be to fine tune parameters, watch runtime statistics, and adjust metrics. Typically, it takes several months to fine tune models in production.
The concept of model observability was founded just a few short years ago, but since then has developed into sophisticated instrumentation for a vast variety of models and approaches. Our general recommendation is to start a model observability journey with data drifts and explainability, which is not extremely complicated to implement, but it might provide a list of insights and improvements of the overall process. Grid Dynamics provides a starter kit to accelerate the process of:
This starter kit helps to reduce time-to-market from the very first observability check implementation, right up to shipping it to production.
Get in touch with us to start a conversation about implementing data observation into your ML environment.