Feature data (or simply called, Feature) are critical to the accurate predictions made by Machine Learning (ML) models. Feature stores recently emerge as an important component of ML stack and it generally enables the following tasks as part of ML workflow:
For a survey of “state-of-the-art” feature store, https://www.featurestore.org/ consolidates and compares the major “feature store”-like systems. As noted, many tech companies has…
This article is about tools and tips that migrate from other cross-cluster Kafka replication tools to the new MirrorMaker (or “MirrorMaker 2”)
There are several advantages of new MirrorMaker. To name a few:
messages which are already replicated by Confluent Replicator is…
“Exactly-once” semantics is a challenging problem in a distributed system. To solve it, some notable protocols and algorithms are: Two-phase commit, Paxos, and Raft. This problem becomes even harder when across two instances of a distributed system.
Large-scale enterprise use cases typically do not only run one Kafka instance, and common scenarios of hosting multiple Kafka instances include (not limited to):
As recently included in Apache Kafka and introduced in my previous blog, new MirrorMaker becomes the officially certified open-source tool that replicates data between two Kafka instances across datacenters.
To have the first-hand experience of new MirrorMaker, in this article, we will walk through the end-to-end deployment on local Kubernetes.
Note: the scripts used in the following may be used in a Kubernetes cluster, but do not warrant a production quality deployment
minikube start --driver=<driver_name>…
Apache Kafka is the de-facto data streaming platform for high-performance data pipelines, streaming analytics and mission-critical applications. For enterprises, as business continues to grow, many scenarios will require to evolve from one Kafka instance to multiple instances. For example, critical services can be migrated and run on dedicated instances to achieve better performance and isolation to satisfy Service Level Agreement or Objective.
Another example is Disaster Recovery (DR) — the instance in a primary datacenter is continuously mirrored to the backup datacenter. …
As introduced in our previous posts (link 1, link 2) many applications behind Walmart.com are being powered by the highly scalable and distributed streaming platform, Apache Kafka. With the high-speed revolution, Kafka has a new milestone, release 0.10. With this release, Kafka and its ecosystem have reached a new level of maturity. In this post, I would like share our recent interesting results with Kafka 0.10 release. The next post will more focus on the streaming, big data and Hadoop ecosystem around Kafka.
In earlier Kafka releases (before 0.8.2), consumers commit their offsets to ZooKeeper. During the last holiday season…
In our previous blog, we introduced “why” we migrated the Kafka service at Walmart from the shared bare-metal machines to the new “self-serving” Kafka deployment that is powered by OpenStack and OneOps. Today, I would like to introduce how the Kafka ecosystem looks like “under-the-hood”.
The above picture does not completely capture the all real time pipelines but aims to highlight the key components and the relationships among them.
Many top organizations have been reported to benefit from Service Oriented Architecture (SOA) and Walmart also re-built its eCommerce website (walmart.com) based on the SOA and elastic cloud. An important subset of SOA is the Message-driven architecture, which serves as a channel for asynchronous communication to decouple the bundled components. The result is a more scalable and efficient architecture where each component or service could be independently crafted and scaled out by communicating with others through the messaging platform.
Traditional message queue used to be the solution of Message-driven architecture, but it has become kind of inherently flawed, when it…
Use less words to make bigger impact