MLOps Architectural Models: An Advanced Guide to MLOps in Practice
Learn about MLOps architectural challenges and ways to manage complexity through this end-to-end guide for MLOps app development with a financial institution use case.
Join the DZone community and get the full member experience.
Join For FreeEditor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Enterprise AI: The Emerging Landscape of Knowledge Engineering.
AI continues to transform businesses, but this leads to enterprises facing new challenges in terms of digital transformation and organizational changes. Based on a 2023 Forbes report, those challenges can be summarized as follows:
- Companies whose analytical tech stacks are built around analytical/batch workloads need to start adapting to real-time data processing (Forbes). This change affects not only the way the data is collected, but it also leads to the need for new data processing and data analytics architectural models.
- AI regulations need to be considered as part of AI/ML architectural models. According to Forbes, "Gartner predicts that by 2025, regulations will force companies to focus on AI ethics, transparency, and privacy." Hence, those platforms will need to comply with upcoming standards.
- Specialized AI teams must be built, and they should be capable of not only building and maintaining AI platforms but also collaborating with other teams to support models' lifecycles through those platforms.
The answer to these new challenges seems to be MLOps, or machine learning operations. MLOps builds on top of DevOps and DataOps as an attempt to facilitate machine learning (ML) applications and a way to better manage the complexity of ML systems. The goal of this article is to provide a systematic overview of MLOps architectural challenges and demonstrate ways to manage that complexity.
MLOps Application: Setting Up the Use Case
For this article, our example use case is a financial institution that has been conducting macroeconomic forecasting and investment risk management for years. Currently, the forecasting process is based on partially manual loading and postprocessing of external macroeconomic data, followed by statistical modeling using various tools and scripts based on personal preferences.
However, according to the institution's management, this process is not acceptable due to recently announced banking regulations and security requirements. In addition, the delivery of calculated results is too slow and financially not acceptable compared to competitors in the market. Investment in a new digital solution requires a good understanding of the complexity and the expected cost. It should start with gathering requirements and subsequently building a minimum viable product.
Requirements Gathering
For solution architects, the design process starts with a specification of problems that the new architecture needs to solve — for example:
- Manual data collection is slow, error prone, and requires a lot of effort
- Real-time data processing is not part of the current data loading approach
- There is no data versioning and, hence, reproducibility is not supported over time
- The model's code is triggered manually on local machines and constantly updated without versioning
- Data and code sharing via a common platform is completely missing
- The forecasting process is not represented as a business process, all the steps are distributed and unsynchronized, and most of them require manual effort
- Experiments with the data and models are not reproducible and not auditable
- Scalability is not supported in case of increased memory consumptions or CPU-heavy operations
- Monitoring and auditing of the whole process are currently not supported
The following diagram demonstrates the four main components of the new architecture: monitoring and auditing platform, model deployment platform, model development platform, and data management platform.
Figure 1. MLOps architecture diagram
Platform Design Decisions
The two main strategies to consider when designing a MLOps platform are:
- Developing from scratch vs. selecting a platform
- Choosing between a cloud-based, on-premises, or hybrid model
Developing From Scratch vs. Choosing a Fully Packaged MLOps Platform
Building an MLOps platform from scratch is the most flexible solution. It would provide the possibility to solve any future needs of the company without depending on other companies and service providers. It would be a good choice if the company already has the required specialists and trained teams to design and build an ML platform.
A prepackaged solution would be a good option to model a standard ML process that does not need many customizations. One option would even be to buy a pretrained model (e.g., model as a service), if available on the market, and build only the data loading, monitoring, and tracking modules around it. The disadvantage of this type of solution is that if new features need to be added, it might be hard to achieve those additions on time.
Buying a platform as a black box often requires building additional components around it. An important criterion to consider when choosing a platform is the possibility to extend or customize it.
Cloud-Based, On-Premises, or Hybrid Deployment Model
Cloud-based solutions are already on the market, with popular options provided by AWS, Google, and Azure. In case of no strict data privacy requirements and regulations, cloud-based solutions are a good choice due to the unlimited infrastructural resources for model training and model serving. An on-premises solution would be acceptable for very strict security requirements or if the infrastructure is already available within the company. The hybrid solution is an option for companies that already have part of the systems built but want to extend them with additional services — e.g., to buy a pretrained model and integrate with the locally stored data or incorporate into an existing business process model.
MLOps Architecture in Practice
The financial institution from our use case does not have enough specialists to build a professional MLOps platform from scratch, but it also does not want to invest in an end-to-end managed MLOps platform due to regulations and additional financial restrictions. The institution's architectural board has decided to adopt an open-source approach and buy tools only when needed. The architectural concept is built around the idea of developing minimalistic components and a composable system. The general idea is built around microservices covering nonfunctional requirements like scalability and availability. Striving for maximal simplicity of the system, the following decisions for the system components were made.
Data Management Platform
The data collection process will be fully automated. There will be a separate data loading component for each data source due to the heterogeneity of external data providers. The database choice is crucial when it comes to writing real-time data and reading a large amount of data. Due to the time-based nature of the macroeconomic data and the institution's already available relational database specialists, they chose to use the open-source database, TimescaleDB.
The possibility to provide a standard SQL-based API, perform data analytics, and conduct data transformations using standard relational database GUI clients will decrease the time to deliver a first prototype of the platform. Data versions and transformations can be tracked and saved into separate data versions or tables.
Model Development Platform
The model development process consists of four steps:
- Data reading and transformation
- Model training
- Model serialization
- Model packaging
Once the model is trained, the parametrized and trained instance is usually stored as a packaged artifact. The most common solution for code storage and versioning is a Git. Furthermore, the financial institution is already equipped with a solution like GitHub, providing functionality to define pipelines for building, packaging, and publishing the code. The architecture of Git-based systems usually relies on a set of distributed worker machines executing the pipelines. That option will be used as part of the minimalistic MLOps architectural prototype to also train the model.
After training a model, the next step is to store it in a model repository as a released and versioned artifact. Storing the model in a database as a binary file, a shared file system, or even an artifacts repository are all acceptable options at that stage. Later, a model registry or a blob storage service could be incorporated into the pipeline. A model's API microservice will expose the model's functionality for macroeconomic projections.
Model Deployment Platform
The decision to keep the MLOps prototype as simple as possible applies to the deployment phase as well. The deployment model is based on a microservices architecture. Each model can be deployed using a Docker container as a stateless service and be scaled on demand. That principle applies for the data loading components, too. Once that first deployment step is achieved and dependencies of all the microservices are clarified, a workflow engine might be needed for orchestrating the established business processes.
Model Monitoring and Auditing Platform
Traditional microservices architectures are already equipped with tools for gathering, storing, and monitoring log data. Tools like Prometheus, Kibana, and ElasticSearch are flexible enough for producing specific auditing and performance reports.
Open-Source MLOps Platforms
A minimalistic MLOps architecture is a good start for the initial digital transformation of a company. However, keeping track of available MLOps tools in parallel is crucial for the next design phase. The following table provides a summary of some of the most popular open-source tools.
Table 1. Open-source MLOps tools for initial digital transformations
Tool | Description | Functional Areas |
Kubeflow | Makes deployments of ML workflows on Kubernetes simple, portable, and scalable | Tracking and versioning, pipeline orchestration, and model deployment |
MLflow | Is an open-source platform for managing the end-to-end ML lifecycle | Tracking and versioning |
BentoML | Is an open standard and SDK for AI apps and inference pipelines; provides features like auto-generation of API servers, REST APIs, gRPC, and long-running inference jobs; and offers auto-generation of Docker container images | Tracking and versioning, pipeline orchestration, model development, and model deployment |
TensorFlow Extended (TFX) | Is a production-ready platform; is designed for deploying and managing ML pipelines; and includes components for data validation, transformation, model analysis, and serving | Model development, pipeline orchestration, and model deployment |
Apache Airflow, Apache Beam | Is a flexible framework for defining and scheduling complex workflows — data workflows in particular, including ML | Pipeline orchestration |
Summary
MLOps is often called DevOps for machine learning, and it is essentially a set of architectural patterns for ML applications. However, despite the similarities with many well-known architectures, the MLOps approach brings some new challenges for MLOps architects. On one side, the focus must be on the compatibility and composition of MLOps services. On the other side, AI regulations will force existing systems and services to constantly adapt to new regulatory rules and standards. I suspect that as the MLOps field continues to evolve, a new type of service providing AI ethical and regulatory analytics will soon become the focus of businesses in the ML domain.
This is an excerpt from DZone's 2024 Trend Report, Enterprise AI: The Emerging Landscape of Knowledge Engineering.
Read the Free Report
Opinions expressed by DZone contributors are their own.
Comments