For Data Engineers

Kafka + FeatureBase: 3 Ways to Maximize Event Streaming to Power Real-Time Analytics

Why is Event Streaming Important?

As our world moves towards being “always-on” and real time, streaming data has become mission-critical. Streaming data consists of events that are constantly accumulating from real-time sources (like databases, sensors, devices, and applications) at a rate that’s faster than many organizations can capture, store, process, and analyze. When it comes to event streaming, an event, or an interaction that has occurred, is made up of three components: an event key, an event value, and a timestamp.

Event Example:

  • ID_1 (event key)
  • Purchased a large coffee for $2.15 from Store #123 with cash while wearing a green shirt (event value)
  • on January 31, 2021 at 7:03 a.m. (event timestamp)

Technically, event streaming is the practice of capturing events data in real time, storing the event streams durably for retrieval later on, and routing the event streams to different destination technologies as needed. Event streams become particularly important for machine-to-machine interactions where there is little or no human involvement. Data is extracted from one application and directly served to another application. Streaming data can also be combined with historical data or enriched with metadata to create a complete picture of a user, device, or other entity.

Real-time use cases for streaming data span industries, including tracking and monitoring inventory, collecting customer experience interactions, and immediately responding with relevant content. Other examples include continuously storing and analyzing data from IoT streams and unifying disparate data across organizational departments. The ability to capture and make use of streaming data is accelerating the applicability and relevance of machine learning initiatives.

One of the most popular tools to ingest streaming data is Apache Kafka, an open-source event streaming platform. An enterprise version, Confluent Cloud, is available as a fully managed Kafka service.

How Does Kafka Work?

For more detailed information, visit the Apache Kafka documentation. Briefly, Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol deployed both on-prem and in the cloud. Kafka is run as a cluster of servers. Some servers in the cluster are brokers for storage, while other servers run Kafka Connect, a tool for continuous import and export of data as event streams.

Kafka has decoupled the production and consumption of data to eliminate bottlenecks that limit scalability and improve fault tolerance. In addition, servers are located at multiple data centers or cloud regions and deployed dynamically to ensure continuous operations. As a result, when one server fails, other servers seamlessly step in and no data is lost.

Kafka combines three key capabilities that allow users to implement event streaming use cases:

  • To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
  • To store streams of events durably and reliably for as long as you want.
  • To process streams of events as they occur or retroactively.

Streams of events are ingested and stored as topics –  think of them as folders. Unlike traditional message queueing systems, events in each Kafka topic can be accessed as often as needed and configured to discard old events whenever the user prefers. To enable scalability, Kafka topics are partitioned or spread across several ‘buckets’ on different Kafka brokers, to allow client applications to read and write data from/to many brokers in parallel. When a new event is published, it is appended to a partition. Events with the same event key (i.e., ‘ID_1’) will be written to the same partition, and read in the same order they were written.

Kafka Example Demo: How to Achieve 1M+ Record/Second Ingest without Sacrificing Query Latency

3 Ways to Maximize Kafka with FeatureBase

FeatureBase is a distributed, highly-scalable database built on bitmaps which is designed to execute analytical queries with extremely low latency, regardless of throughput or query volumes. It is well-suited for workloads that require real-time analytics on data that is continuously updated. FeatureBase aligns exceptionally well with Kafka’s technology and can leverage its configuration flexibility to the extreme. FeatureBase transforms data into a novel format that works with Kafka’s topic partitioning, pushing the boundaries of event stream ingestion rates to new speeds, allowing for on-the-fly schema changes, and enabling transformations at ingest. Please keep reading to learn more about each scenario and how to achieve it with Kafka + FeatureBase.

1. Maximize data ingest throughput

FeatureBase provides a high degree of flexibility for typical user interactions with Kafka. FeatureBase’s proprietary data format is optimized for computation and easily handles very high Kafka throughput (i.e., the amount of data ingested in a specific timeframe, usually measured in seconds). At present, Kafka’s throughput is considered ‘high’ by the Molecula team once it exceeds 250,000 records per second. This speed is roughly the rate a security company monitoring clickstream data would reach, assuming each record is approximately 1 KB (note: the size of each record can vary significantly across implementations). With FeatureBase, Kafka ingest rates can easily exceed 1 million records per second.

We use custom keys in Kafka to partition data the same way it’s partitioned in FeatureBase, which yields significantly better ingest efficiency without modifying Kafka’s native partitioning strategy. For example, in the following diagram, our approach to partitioning is leveraged within the yellow highlighted section.

FeatureBase determines a record’s partition based on a hash algorithm using various bits of information derived from each record. Because of this, it’s possible to know where a record will land in FeatureBase before it’s ever sent to a Kafka Producer. Further, we improve the record’s journey to FeatureBase by architecting the correct number of Kafka partitions based on the number of FeatureBase consumers (see Fig. 2: turquoise box). Ultimately, as records flow in, a quick hash run indicates where that record should go within Kafka’s partition structure and is sorted into 1 of n number of partitions. FeatureBase, in turn, spawns the requisite number of consumers to maximize the ingestion rate and load balance across the available hardware.

2. Add Data in New Fields Rapidly

Ingestion rates are just the beginning of FeatureBase’s abilities to maximize Kafka event streams. In addition to how fast data comes into the system, customers are often interested in how quickly you can update custom attributes. FeatureBase can instantly add new attributes (including updates and inserts) to a data schema even when ingest rates hit 1 million records per second. With most technologies ingesting from Kafka, adding new fields or attributes to a data schema requires engineering support. This support involves lengthy timelines to modify existing preaggregation scripts, engineer new data pipelines, and apply the update to large volumes of historical data (a process that’s often skipped due to the time, effort, and cost of implementation).  

With FeatureBase, a customer can add an attribute to their data schema with a simple configuration change, provided that the new field exists in a Kafka Topic. Once configured, the attribute instantly updates across billions of records. The new attribute is immediately available for new data streaming in and across the history of your data.

Based on our Event Example above, we would like to add this small JSON message flowing through Kafka to our FeatureBase table. This message indicates that the three IDs contained in the array need to have the color green added under a new field named Shirt Color.

To add this record to the FeatureBase table shown below, which currently contains IDs and the geographic states in which they belong, we simply add a new FeatureBase row and set the requisite bits for the given IDs. This operation is much less complex than updating stores based on columnar or relational models, as you do not need to represent the same element repeatedly.  Albeit a simple example, as this table expands to encapsulate billions of unique IDs , this operation will scale alongside it, allowing for rapid updates to dynamic tables.

3. Schema Registry Automatic Updates

One more way that FeatureBase can use Kafka to adapt to evolving data needs is by identifying and executing changes in underlying data schemas. For example, a  common demand among enterprise customers is the ability to alter their data stores as their source data changes to keep up with new information about their customers, processes, devices, or the target source in general. Typically, this would require manual changes and cut-over periods. With FeatureBase + Kafka, it’s possible to reference a schema registry and use it to add FeatureBase fields dynamically.

FeatureBase consumers recognize changes in the upstream data schema and respond to those changes by dynamically creating and populating new FeatureBase fields.

  1. An initial Avro schema is registered with a schema registry.
  2. The schema can be dynamically altered using APIs either intentionally or by downloading from the registry’s schema read history.
  3. The new field is put into a JSON format, added to the schema registry, and republished.
  4. At the same time, producers are constantly producing messages, and simultaneously checking for new schemas.
  5. The cache is allowed to build up for a few seconds to provide the necessary time to execute on a newly added schema and a record containing the new schema is identified.
  6. The consumers resolve the ID: schema relationship, commit the current batch, and begin writing based on the new schema.

Consumers can flip back and forth if the two (or many) schemas are mixed within the same record batch and ingestion rates remain able to scale to hundreds of thousands of records per second. It’s possible to further refine this strategy with intuitive partitioning,  including predefined consumers searching for only new schemas. However, even at high throughput, records distribute among the consumers and they keep up with batch commits and schema additions in production quite well.

WANT TO LEARN MORE?

As you can see, while FeatureBase was not explicitly built to work with Kafka, it complements Kafka’s technology incredibly well, augmenting Kafka’s strengths and allowing users to unlock the actual value of real-time event streams. FeatureBase can provide optimized configuration settings that increase throughput, offer an easy way to update data schemas, and perform transformations at ingest so that your Kafka instance becomes even more powerful.

Get Started for Free

Open Source install commands are included below.

open source CODESTART CLOUD TRIAL

git clone https://github.com/FeatureBaseDB/featurebase-examples.git
cd featurebase-examples/docker-example

docker-compose -f docker-compose.yml up -d

# TIP: Disable Docker compose v2 if needed by going to settings..general in Docker Desktop.

git clone https://github.com/FeatureBaseDB/featurebase-examples.git
cd featurebase-examples/docker-example

docker-compose -f docker-compose.yml up -d

# TIP: Disable Docker compose v2 if needed by going to settings..general in Docker Desktop.