top of page
  • Linkedin
  • GitHub

Leveraging Databricks and Kubernetes for Efficient, Scalable Machine Learning

Feb 8

5 min read

1

42

0


Overview

I collaborated with a multi-cloud data consulting firm to implement this project, leveraging existing Azure Databricks infrastructure to build and manage a machine learning (ML) model, ensuring low-latency, interactive predictions.

This post explores using Azure Databricks for ML model management and Azure Kubernetes for deployment on Microsoft Azure, with step-by-step guidance for creating a proof-of-concept (PoC) using open-source technologies that can be adapted to other workloads.


Considerations

Before implementing this solution, consider the following factors:


  • This solution is best suited for teams with deep expertise in deploying and managing Kubernetes workloads and requires a high level of customization. If your team lacks this expertise, consider using a service like Azure Machine Learning for model deployment.

  • Your analytics use case may require services or features not included in this design.

The Machine Learning DevOps Guide from Microsoft offers valuable best practices and insights on adopting enterprise-scale machine learning operations (MLOps).

 

High-level Design

High-level design
High-level design

This high-level design leverages Azure Databricks and Azure Kubernetes Service (AKS) to build an MLOps platform that supports online and batch inference deployment patterns. The solution efficiently manages the entire machine learning lifecycle and integrates essential MLOps principles for developing, deploying, and monitoring models at scale.


Additionally, the solution addresses each stage of the machine learning lifecycle:

  • Data Preparation - involves sourcing, cleaning, and transforming data for analysis. The data can be stored in a data lake or warehouse and moved to a feature store after curation.

  • Model Development - includes key components of the model development process, such as experiment tracking and model registration using MLflow.

  • Model Deployment - involves implementing a continuous integration/continuous delivery (CI/CD) pipeline to build and deploy solutions for online inference workloads. Machine learning models are containerized as API services and deployed to an Azure Kubernetes cluster, with Azure API Management providing external access.

  • Model Monitoring - includes monitoring API performance and data drift by analyzing log telemetry through Azure Monitor.


Note that this high-level diagram doesn’t include security features, like firewalls or virtual networks, that large organizations need when adopting cloud services. Additionally, MLOps involves changes in people, processes, and technology, which may impact the services, features, or workflows your organization chooses, and are not covered in this design. The Microsoft Machine Learning DevOps Guide offers best practices to consider.

 

Proof of Concept Workflow

PoC workflow
PoC workflow

The diagram above illustrates an end-to-end proof-of-concept (PoC) demonstrating how an MLflow model can be trained on Azure Databricks, packaged as a web service, deployed to Azure Kubernetes Service via CI/CD, and monitored within Microsoft Azure.

This PoC focuses on the online inference deployment pattern and features a simplified architecture compared to the high-level design shown earlier.

Required Services

  1. Azure Databricks workspace - to build ML models, track experiments, and manage ML models.

  2. Azure Kubernetes Service (AKS) - to deploy containers exposing as web service to end-users (one for each staging and production environment).

  3. Azure Container Registry (ACR) - to manage and store Docker containers.

  4. Azure Repos - to store code for the project and enable automation by building and deploying artifacts.

  5. Azure Log Analytics Workspace (optional) - to query log telemetry in Azure Monitor.


Some services should be further configured as part of this PoC: Azure Kubernetes Servicecontainer insights should be enabled to collect metrics and logs from containers running on AKS. This will be used to monitor API performance and analyze logs. Azure Databricks: the files in repo feature should be enabled and a cluster be  created for Data Scientists, Machine Learning Engineers, and Data Analysts to use for developing models. Azure Repo: two Azure Repo Environments should be created for Staging and Production environments. This PoC deploys all resources into a single resource group by default. For production, it’s recommended to use multiple resource groups across subscriptions for better security and governance (see Azure Enterprise-Scale Landing Zones), with services deployed via infrastructure as code (IaC).

 

Project Tree

The following folders and files play a key role in packaging and deploying the model API service:

 

Model Development

For this PoC, the model development process has been encapsulated in a single notebook, "TestNotebook". This model uses the IBM HR Analytics Employee Attrition & Performance dataset from Kaggle.

Training notebook in Azure Databricks
Training notebook in Azure Databricks

This notebook will develop and register an MLflow model for deployment consisting of:

  • An ML model to predict the likelihood of employee attrition

  • A statistical model to determine data drift in features (optional)

  • A statistical model to determine outliers in features (optional)


In practice, model development requires more effort than shown in this notebook and often involves multiple notebooks. While key MLOps aspects like explainability, performance profiling, and pipelines are not included in this PoC, essential components like experiment tracking, model registration, and versioning are covered.

After running this notebook, the MLflow model will be registered, and training metrics will be captured in the MLflow model registry.

Registered model in Azure Databricks
Registered model in Azure Databricks

Model Configuration

A JSON configuration file defines which model versions from the MLflow registry should be deployed in the API. Team members can update this file to select models for deployment and commit it to the Git repository when ready.

Configuration file structure
Configuration file structure

 

Model Deployment

The PoC uses an Azure Pipeline to automatically build and deploy artifacts to AKS. The pipeline is triggered when commits are pushed, or a pull request is made to the main branch.

CI/CD workflow
CI/CD workflow

The CI/CD pipeline consists of two jobs:

  • Build - This job downloads model artifacts from the Databricks MLflow registry, packages the MLflow model with the FastAPI web service and dependencies into a Docker container image (model inference API), and stores it in ACR.

  • Staging - Triggered automatically after the Build job, this job deploys the container image to the AKS cluster in the staging environment, updating the model's state to "Staging" in the MLflow model registry.

For production scenarios, a CI/CD pipeline should include additional components such as unit tests, integration tests, code quality scans, security scans, and performance tests.

CI/CD Build job pipeline
CI/CD Build job pipeline
CI/CD Staging job pipeline
CI/CD Staging job pipeline

A Kubernetes manifest file, defined in manifests/api.yaml, specifies the desired state of the model inference API that Kubernetes will manage.

Kubernetes manifest file
Kubernetes manifest file

Prediction Service

The PoC prediction service includes a single predict endpoint. When called, the request is passed, and the records are processed by the MLflow model. This service is defined in the service/app directory.

predict endpoint
predict endpoint

Once the model inference API is deployed, the Swagger UI can be accessed via the IP address of the AKS ingress controller, as shown below:

AKS ingress controller
AKS ingress controller

Swagger UI
Swagger UI
Swagger UI enables both development teams and end users to visualize and interact with the API's resources, without requiring any implementation logic







About

Benjamin ("Benj") Tabares Jr. is an experienced data practitioner with a strong track record of successfully delivering short- and long-term projects in data engineering, business intelligence, and machine learning. Passionate about solving complex customer challenges, Benj leverages data and technology to create impactful solutions. He collaborates closely with clients and stakeholders to deliver scalable data solutions that unlock business value and drive meaningful insights from data.


Comments

Commenting on this post isn't available anymore. Contact the site owner for more info.

Send a Message

Thanks for submitting!

benjamintabaresjr.com is a business intelligence and data engineering independent consultancy that helps businesses transform their data into actionable insights.

Philippines

© 2025 benjamintabaresjr.com. All rights reserved.

Designed and secured by Wix

bottom of page