Building ML Pipelines to Learn from Data in Real-time (Survey)

7 min readFeb 13, 2021

In this series, the author shares his hands-on journey building real-time ML pipelines using available open-source tools and documents the difficulties along the way.

How much value can you generate by updating models with real-time streams? Photo by Todd Trapani on Unsplash

In Machine learning is going real-time, author Chip Huyen classifies two levels of real-time machine learning systems:

Level 1: ML systems that can make predictions in real-time (online predictions)
Level 2: ML systems that can continuously learn from new data and update the model in real-time (online learning)

Systems in the first level are capable to make relevant/timely predictions, such as ad ranking, fall detection, fraud detection, etc. Predicting the stock value in real-time is more valuable than after a month, right? But for Level 1 ML systems, the technical challenges are mostly solved by maturing open-source tools such as Apache Flink (stream processing), Kafka (event-driven architecture), model compression, Kubeflow (workflow management) and Seldon (autoscaling model serving). All these tools have clear development roadmaps to continuously improve model inference speed and easing pipeline development.

The harder part, however, is implementing Level 2 ML systems that can do online learning and update models with new data in real-time. There are little discussion and consensus in the MLOps community on how to build them yet. In fact, SIG MLOps from CDFoundation lists online-learning as ‘research required’ until 2024. So, how easy it is to build a real-time machine learning pipeline with current open-source tools?

1. The Benefits of Online-Learning

Wait, before we further the road, why we want to do online-learning at all? In recent years, the maturity of stream processors allows DataOps to (1) process real-time data in a scalable manner (2) unify batch and stream processing, moving from Lamba architecture to the simpler Kappa architecture. They provide constant streams of real-time data into our pipeline and makes us question: can we extract more value from these data in real-time?

The answer (you’ve probably guessed it): yes, we can. Online-learning updates our ML models more frequently to handle cases such as trending contents on social-medias, rare events (Black Friday) and even during cold-start (new user behaviours). It shines when you need to improve model performance in responding to dynamic data fast. As such, it’s worth asking: how much value you can generate by increasing the frequency of model updates?

2. Architecture for Real-time ML Pipelines

Okay, now that we understand the benefits, how should we design real-time ML pipelines? There are a few industry references available online and, unsurprisingly, these companies possess a huge amount of real-time user interaction data. Before looking at them, it’s worth reviewing offline ML pipelines because evolving them to online still involves the same processes, except that it becomes a long-running continuous loop — ingestion, training and deployment.

(from: https://www.oreilly.com/library/view/building-machine-learning/9781492053187/ch01.html; Online ML Pipelines update model continuously, although may have some delay in deployment to validate models)

During Flink Forward Virtual 2020, Weibo (social media platform) shared the design of WML, their real-time ML architecture and pipeline. Essentially, they’ve integrated previously separated offline and online model training into a unified pipeline with Apache Flink. The talk focused on how Flink is used to generating sample data for training by joining offline data (social posts, user profile, etc) with multiple streams of real-time interaction events (clickstream, read stream, etc) and extracted multimedia content feature. No specific model architecture was mentioned, but they are in progress to online training DNN.

**(Left)** Unified pipeline using Flink for both online & offline | **(Right)** Online Model Training (from: https://www.ververica.com/blog/flink-for-online-machine-learning-and-real-time-processing-at-weibo)

Meanwhile, PAI (Platform of Artificial Intelligence) is a product from Aliyun (cloud vendor), which allows users to build online machine learning pipelines through drag-and-drop visual UI. In early 2019, they released FTRL (Follow The Regularised Leader) online-learning module on the platform to support recommender use cases, such as news recommendation. Later that year, on Alibaba’s (e-commerce) engineering blog, they shared how they used online-training to predict Click Through Rate (CTR) and Conversion Rate (CVR) during Singles’ Day.

Due to many factors that affect CTR/CVR, such as promotion and season, the model uses a “multi-layer multi-frequency” approach. It’s evolved from XNN, and part of the model uses checkpoint of frozen embeddings while some part uses changing embeddings/changing weights. The pipeline performs online training and also real-time model update (A/B test). They run online evaluation using AUC to compare model accuracy with samples from 10 minutes before. (Alibaba also open-sourced the online FTRL as part of Alink, mentioned in next the section.)

**(Left)** PAI drag-and-drop visual builder for FTRL module (from: https://web.archive.org/web/20200815083452/https://developer.aliyun.com/article/688613) **(Right)** Model architecture for predicting Click Through Rate and Conversion Rate, with part of the model updated in real-time (from: https://developer.aliyun.com/article/741004)

Almost the same time, iQiyi (video streaming) shared their experience online training Wide & Deep DNN for Streaming Recommender System. They used a “lambda” approach: offline training model on data from the last 7 days and use that checkpoint as the beginning of their daily online model real-time ingestion. The online training only has one-pass, then deploy the model hourly. The reason for using this architecture is because to correct online model when it overfits local patterns and observed performance degradation with a high chance of OOV (out-of-vocabulary). Next step, they are exploring to replace Adam optimizer and deployment frequency.

(Left) Online & Offline ML Pipeline Design (Middle) Model Architecture for Streaming Recommender System (Right) AUC performance improvement (yellow and orange are online trained-models) (from: https://www.infoq.cn/article/lthcdazelzgc639p1p5q)

Looking from the examples above, several trends are clear:

companies are unifying MLOps pipeline for both offline and online training
they are also moving towards online-training deeper models with DNN
and experiment with more frequent model deployment (<10 minutes, or even in real-time!)

3. Tools and Libraries

There are several groups of open-source libraries for implementing online training in ML pipelines. Some build on top of Flink, while some use Spark Streaming. Most of them require Java to run on streaming processors, but some provide API or even run natively on Python (friendlier for developers from the Python deep learning ecosystem!).

River — designed from ground-up for online training: it can update models with a single observation, contains utilities to build preprocessing pipeline and evaluation metrics for streaming data. The only library with most online learning algorithms, including Tree/FM/Linear/Naive Bayes/KNN classifiers & regressors, K-means clustering and Half-Space Tree anomaly detection. Its SKLearn style API will be very familiar to many, the best choice if you prefer Python!
FlinkML — provides a few ML algorithms, but the older library was dropped in Flink 1.9. Currently, under redevelopment utilising Flink Table API to provide pipeline interface and model serving. The progress is available here (only a few interfaces at the time of this writing).
Alink — collections of ML algorithm based operators released by Alibaba and build on top of FlinkML new pipeline interfaces. It has both Python and Java API. However, FTRL is the only online training algorithm, others are batch algorithms. The best option if you prefer the Flink ecosystem.
Spark Mllib’s Streaming Algorithms — similar to FlinkML that provides several streaming ML algorithms, but on the Spark Streaming ecosystem. It has both Python and Scala API to build pipeline and export PMML model. Also being redeveloped since Spark 2.0 to utilise DataFrame API. The older library contains online Kmeans clustering and Linear regression.
Oryx — a lambda architecture development platform, using Kafka and Spark. It provides Batch, Speed and Serving layers to manage model training and updates (PMML). Currently contains ALS Recommendation, K-means clustering and Random Forest Classification/Regression streaming algorithms. Unfortunately, no major development since 2019.

Overall, all libraries are still under development and most prioritises batch/offline training algorithms. River is the only library prioritises online-training with the most algorithms, while others started from offline training may be more ready for deployment in production.

4. Challenges Ahead

This article explores the current state of common architectures and open source tools, focusing on the implementation and engineering perspective, for building real-time machine learning pipelines. In subsequent series, I’ll build real-time ML pipelines and continuously explore ideas to improve them, such as comparing checkpoint+online training and 100% online training, real-time model deployment, etc.

Up Next: Part 2 (Coming Soon)

If you enjoyed this content, click the 👏 so more people will see this.

Building ML Pipelines to Learn from Data in Real-time (Survey)

1. The Benefits of Online-Learning

2. Architecture for Real-time ML Pipelines

3. Tools and Libraries

4. Challenges Ahead

Written by Jin Cong Ho