State-of-the-art open-source and homegrown feature stores that generate, manage and serve features at scale

Photo Courtesy: dlanor_s on unsplash

Feature data (or simply called, Feature) are critical to the accurate predictions made by Machine Learning (ML) models. Feature stores recently emerge as an important component of ML stack and it generally enables the following tasks as part of ML workflow:

  1. Automate feature computation, e.g. backfill, UDF

For a survey of “state-of-the-art” feature store, consolidates and compares the major “feature store”-like systems. As noted, many tech companies has…

Open-source tool that migrates from other replication tools to MirrorMaker

Photo Courtesy: Jeremy Bishop on unsplash

This article is about tools and tips that migrate from other cross-cluster Kafka replication tools to the new MirrorMaker (or “MirrorMaker 2”)

There are several advantages of new MirrorMaker. To name a few:

  • always open-source in Apache Kafka ecosystem


Recently users from the community have been migrating from Confluent Replicator (an enterprise commercial “cross-cluster” replication tool) to MirrorMaker and they were facing the following problem:

messages which are already replicated by Confluent Replicator is…

Tackle cross-cluster transaction problems in Kafka with code

Photo Courtesy: Nathan Dumlao on unsplash

“Exactly-once” semantics is a challenging problem in a distributed system. To solve it, some notable protocols and algorithms are: Two-phase commit, Paxos, and Raft. This problem becomes even harder when across two instances of a distributed system.

Apache Kafka has supported “Exactly-once” (a.k.a. transaction) in the context of one instance or one cluster three years ago and kept iterating over that time: KIP-447 KIP-360 KIP-588.

Large-scale enterprise use cases typically do not only run one Kafka instance, and common scenarios of hosting multiple Kafka instances include (not limited to):

(1) disaster recovery

(2) multiple “local” instances for locality, then “aggregate”…

A step-by-step walkthrough with Kubernetes deployment script

Image Courtesy of athree23 on

As recently included in Apache Kafka and introduced in my previous blog, new MirrorMaker becomes the officially certified open-source tool that replicates data between two Kafka instances across datacenters.

To have the first-hand experience of new MirrorMaker, in this article, we will walk through the end-to-end deployment on local Kubernetes.

As a prerequisite, Minikube and an instance of Virtual Machine Monitor (e.g. VirtualBox, VMWare Fusion…) need to be installed on local before the following steps.

Note: the scripts used in the following may be used in a Kubernetes cluster, but do not warrant a production quality deployment

Step 1: start local Kubernetes

minikube start --driver=<driver_name>…

Introduction of a new cross-datacenter replication tool for Apache Kafka

Image Courtesy of sumanley on


Apache Kafka is the de-facto data streaming platform for high-performance data pipelines, streaming analytics and mission-critical applications. For enterprises, as business continues to grow, many scenarios will require to evolve from one Kafka instance to multiple instances. For example, critical services can be migrated and run on dedicated instances to achieve better performance and isolation to satisfy Service Level Agreement or Objective.

Another example is Disaster Recovery (DR) — the instance in a primary datacenter is continuously mirrored to the backup datacenter. …

As introduced in our previous posts (link 1, link 2) many applications behind are being powered by the highly scalable and distributed streaming platform, Apache Kafka. With the high-speed revolution, Kafka has a new milestone, release 0.10. With this release, Kafka and its ecosystem have reached a new level of maturity. In this post, I would like share our recent interesting results with Kafka 0.10 release. The next post will more focus on the streaming, big data and Hadoop ecosystem around Kafka.

Kafka 0.10 and its Downstream Consumers around it

Consumer offset storage: Kafka topic

In earlier Kafka releases (before 0.8.2), consumers commit their offsets to ZooKeeper. During the last holiday season…

In our previous blog, we introduced “why” we migrated the Kafka service at Walmart from the shared bare-metal machines to the new “self-serving” Kafka deployment that is powered by OpenStack and OneOps. Today, I would like to introduce how the Kafka ecosystem looks like “under-the-hood”.

Kafka Ecosystem at Walmart

The above picture does not completely capture the all real time pipelines but aims to highlight the key components and the relationships among them.

Core Services

  • Kafka Brokers: we are currently rolling out a Kafka version with the suggested JVM parameters, to take advantage of better stability and reliability, comparing to 0.8 family.

Photo Credit: Apache Kafka

Many top organizations have been reported to benefit from Service Oriented Architecture (SOA) and Walmart also re-built its eCommerce website ( based on the SOA and elastic cloud. An important subset of SOA is the Message-driven architecture, which serves as a channel for asynchronous communication to decouple the bundled components. The result is a more scalable and efficient architecture where each component or service could be independently crafted and scaled out by communicating with others through the messaging platform.

Traditional message queue used to be the solution of Message-driven architecture, but it has become kind of inherently flawed, when it…

OneOps is a multi-cloud and open-source orchestration platform for DevOps that has the following major advantages:

  • DevOps orchestration: integrate popular open-source or free DevOps tools and orchestrate them on a nice web UI.


Use less words to make bigger impact

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store