Leveraging Databricks and Kubernetes for Efficient, Scalable Machine Learning

Feb 8

5 min read

Overview

I collaborated with a multi-cloud data consulting firm to implement this project, leveraging existing Azure Databricks infrastructure to build and manage a machine learning (ML) model, ensuring low-latency, interactive predictions.

This post explores using Azure Databricks for ML model management and Azure Kubernetes for deployment on Microsoft Azure, with step-by-step guidance for creating a proof-of-concept (PoC) using open-source technologies that can be adapted to other workloads.

Considerations

Before implementing this solution, consider the following factors:

This solution is best suited for teams with deep expertise in deploying and managing Kubernetes workloads and requires a high level of customization. If your team lacks this expertise, consider using a service like Azure Machine Learning for model deployment.
Your analytics use case may require services or features not included in this design.

The Machine Learning DevOps Guide from Microsoft offers valuable best practices and insights on adopting enterprise-scale machine learning operations (MLOps).

High-level Design

This high-level design leverages Azure Databricks and Azure Kubernetes Service (AKS) to build an MLOps platform that supports online and batch inference deployment patterns. The solution efficiently manages the entire machine learning lifecycle and integrates essential MLOps principles for developing, deploying, and monitoring models at scale.

Additionally, the solution addresses each stage of the machine learning lifecycle:

Data Preparation - involves sourcing, cleaning, and transforming data for analysis. The data can be stored in a data lake or warehouse and moved to a feature store after curation.
Model Development - includes key components of the model development process, such as experiment tracking and model registration using MLflow.
Model Deployment - involves implementing a continuous integration/continuous delivery (CI/CD) pipeline to build and deploy solutions for online inference workloads. Machine learning models are containerized as API services and deployed to an Azure Kubernetes cluster, with Azure API Management providing external access.
Model Monitoring - includes monitoring API performance and data drift by analyzing log telemetry through Azure Monitor.

Note that this high-level diagram doesn’t include security features, like firewalls or virtual networks, that large organizations need when adopting cloud services. Additionally, MLOps involves changes in people, processes, and technology, which may impact the services, features, or workflows your organization chooses, and are not covered in this design. The Microsoft Machine Learning DevOps Guide offers best practices to consider.

Proof of Concept Workflow

The diagram above illustrates an end-to-end proof-of-concept (PoC) demonstrating how an MLflow model can be trained on Azure Databricks, packaged as a web service, deployed to Azure Kubernetes Service via CI/CD, and monitored within Microsoft Azure.

This PoC focuses on the online inference deployment pattern and features a simplified architecture compared to the high-level design shown earlier.

Required Services

Azure Databricks workspace - to build ML models, track experiments, and manage ML models.
Azure Kubernetes Service (AKS) - to deploy containers exposing as web service to end-users (one for each staging and production environment).
Azure Container Registry (ACR) - to manage and store Docker containers.
Azure Repos - to store code for the project and enable automation by building and deploying artifacts.
Azure Log Analytics Workspace (optional) - to query log telemetry in Azure Monitor.

Some services should be further configured as part of this PoC: •Azure Kubernetes Service: container insights should be enabled to collect metrics and logs from containers running on AKS. This will be used to monitor API performance and analyze logs. •Azure Databricks: the files in repo feature should be enabled and a cluster be created for Data Scientists, Machine Learning Engineers, and Data Analysts to use for developing models. •Azure Repo: two Azure Repo Environments should be created for Staging and Production environments. This PoC deploys all resources into a single resource group by default. For production, it’s recommended to use multiple resource groups across subscriptions for better security and governance (see Azure Enterprise-Scale Landing Zones), with services deployed via infrastructure as code (IaC).

Project Tree

The following folders and files play a key role in packaging and deploying the model API service:

Model Development

For this PoC, the model development process has been encapsulated in a single notebook, "TestNotebook". This model uses the IBM HR Analytics Employee Attrition & Performance dataset from Kaggle.

This notebook will develop and register an MLflow model for deployment consisting of:

An ML model to predict the likelihood of employee attrition
A statistical model to determine data drift in features (optional)
A statistical model to determine outliers in features (optional)

In practice, model development requires more effort than shown in this notebook and often involves multiple notebooks. While key MLOps aspects like explainability, performance profiling, and pipelines are not included in this PoC, essential components like experiment tracking, model registration, and versioning are covered.

After running this notebook, the MLflow model will be registered, and training metrics will be captured in the MLflow model registry.

Model Configuration

A JSON configuration file defines which model versions from the MLflow registry should be deployed in the API. Team members can update this file to select models for deployment and commit it to the Git repository when ready.

Model Deployment

The PoC uses an Azure Pipeline to automatically build and deploy artifacts to AKS. The pipeline is triggered when commits are pushed, or a pull request is made to the main branch.

The CI/CD pipeline consists of two jobs:

Build - This job downloads model artifacts from the Databricks MLflow registry, packages the MLflow model with the FastAPI web service and dependencies into a Docker container image (model inference API), and stores it in ACR.
Staging - Triggered automatically after the Build job, this job deploys the container image to the AKS cluster in the staging environment, updating the model's state to "Staging" in the MLflow model registry.

For production scenarios, a CI/CD pipeline should include additional components such as unit tests, integration tests, code quality scans, security scans, and performance tests.

A Kubernetes manifest file, defined in manifests/api.yaml, specifies the desired state of the model inference API that Kubernetes will manage.

Prediction Service

The PoC prediction service includes a single predict endpoint. When called, the request is passed, and the records are processed by the MLflow model. This service is defined in the service/app directory.

Once the model inference API is deployed, the Swagger UI can be accessed via the IP address of the AKS ingress controller, as shown below:

Swagger UI enables both development teams and end users to visualize and interact with the API's resources, without requiring any implementation logic

About

Benjamin ("Benj") Tabares Jr. is an experienced data practitioner with a strong track record of successfully delivering short- and long-term projects in data engineering, business intelligence, and machine learning. Passionate about solving complex customer challenges, Benj leverages data and technology to create impactful solutions. He collaborates closely with clients and stakeholders to deliver scalable data solutions that unlock business value and drive meaningful insights from data.

Feb 8

5 min read

Comments

Commenting on this post isn't available anymore. Contact the site owner for more info.

Leveraging Databricks and Kubernetes for Efficient, Scalable Machine Learning

Send a Message