Instacart Speeds ML Deployments with Hybrid MLOps Platform – The New Stack

2022-08-13 03:02:55 By : Ms. fenglian Ao

Grocery delivery service Instacart recently spun up a new Machine Learning platform, called Griffin, that tripled the number of ML applications that the service spun up in a year.

Instacart began developing its machine learning infrastructure in 2016 with Lore, an open sourced framework. After years of rapid growth leading to an increase in the amount, diversity, and complexity of ML applications, Lore’s monolithic architecture was increasingly becoming a bottleneck.

This bottleneck challenge led to the development of Griffin, a hybrid, extensible platform that supports diverse data management systems, and integrates with multiple ML tools and workflows. Sahil Khanna’s recent blog post goes into great detail about Griffin, including its benefits, components, and workflows.

Instacart relies heavily on machine learning for product and operation innovations. Such innovations don’t come easy as multiple machine learning models often must work together to provide a service. Griffin, built by the machine learning infrastructure team, now plays a foundational role in supporting the following machine learning applications and empowering innovations.

In short, Griffin offers the following benefits to the service:

To allow Instacart to stay current with innovations in the state of the art of ML operations (MLOps) while also deploying specialized and diverse solutions, Griffin was designed as a hybrid model. Griffin allows Machine Learning Engineers (MLE) to utilize third-party solutions such as Snowflake, Amazon Web Services, Databricks, and Ray to support diverse use cases and in-house abstraction layers to provide unified access to those solutions.

Griffin was created with the main goals of helping MLEs quickly iterate on machine learning models, effortlessly manage product releases, and closely track production applications. With that in mind, the system was built with these major considerations:

The diagram below illustrates Griffin Systems Architecture.

The considerations are clearly illustrated in the diagram above. Griffin integrates multiple SaaS solutions including Redis, Scylla, and S3 demonstrating extensibility which supports growth at Instacart showing its scalability. The integrated interface for the MLEs shows Griffin’s generality.

Instacart can develop specialized solutions for distinct use cases (such as real-time recommendations) as a result of the four foundational concepts introduced below which are also considered distinct elements.

MLCLI allows MLEs to customize and execute tasks such as training, evaluation, and inference in their applications within containers (Docker for example). Containerization eliminates bugs caused by variations in execution environments and provides a unified interface.

The diagram below illustrates MLCLI features used by MLE’s during ML application development.

Workflow Manager handles the scheduling and managing of the machine learning pipelines. It leverages Airflow to schedule containers and utilizes ML Launcher, an in-house abstraction, to containerize task execution.

ML Launcher integrates third-party compute backends such as Sagemaker, Databricks, and Snowflake to perform container runs and meet unique hardware requirements for ML. Instacart chose this design because it allows for the scaling up to hundreds of Directed Acyclic Graphs (DAGs) with thousands of tasks in a short period without worrying about Airflow run time.

The diagram below illustrates the Architecture Design of Workflow Manager and ML Launcher.

With data being the center of any MLOps platform, Instacart developed its FM product to support both real-time and batch engineering. FM manages feature computation, provides feature storage, supports feature discoverability, eliminates offline/online feature drift, and allows feature sharing. This product uses third-party platforms such as Snowflake, Spark, and Flint and integrates multiple storage backends, Scylla, Redis, and S3.

The diagram below illustrates the Architecture Design of Feature Marketplace.

The Inference and Training Platform allows MLEs to define the model architecture and inference routine to customize applications which allowed Instacart to triple the number of ML applications in one year. Instacart standardized package, metadata, and code management to support diversity in frameworks and ensure reliable model deployment. Some of the frameworks already adopted were Tensorflow, XGBoost, and Faiss.

The diagram below illustrates the Architecture Design of the Inference and Training Platform.

Some valuable lessons were learned during the development of Griffin.

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.