For Data Engineers

#1 Best Database to Combine Streaming and Historical Data

Your Stream is Only Half the Story: Data Requires Context (in Real Time)

Struggling to combine streaming and historical data to enable real-time analytics? You’ve come to the right place.
Streaming data is here to stay. We all know that by now. But we might disagree when it comes to the power of streaming data. Although it’s enabled fantastic innovation, streaming data alone does not provide a complete picture of a user/device/etc. Over the last couple of years (mirroring the rise of Apache Kafka/Confluent), however, there’s been an increase in companies that specifically enable analytics on your data stream. Cool. Problem solved. Streaming data equals real-time data, right?

Not so much.

Streaming data is a massive step forward in real-time processing and analyzing, but it’s only half of the story. Today, streaming data is essentially worthless if it’s not combined with historical (batch-oriented) data. The reality is that companies are still building robust streams, only to batch or cache them to be able to combine streaming and historical data. If this seems silly, you’ve likely never dealt with data pipelines or data schemas. Historical, batch-oriented data is the context that underlies streaming data, but in practice, it can be extremely challenging to combine the two.

During last month’s Data Council event here in Austin, one of the significant themes threaded through many of the talks was data context. In its essence, any amount of data is worthless without context. Of course, we all agree on that when it comes to the need for metadata (i.e., noting where data came from, how it’s been used, how it’s been transformed, etc.), but there’s another level — streaming data is nothing without the contextual augmentation of historical data.

The Stream-Oriented Landscape

In the last 3 to 4 years, there’s been an explosion in companies that help you get value out of your streaming data. Case in point:

Materialize:

Materialize is a SQL streaming analytics tool that allows you to build and cache materialized views directly on top of your data streams to reduce compute costs, etc.

Rockset:

Rockset is a powerful real-time analytics tool through its creation of a Converged Index™, but it lacks a critical capability: the capacity to combine streaming and historical data live within a single collection.

The Problem with Most Databases:

Today, most of our databases are built for niche actions: time series or graph or triple-indexing, or column-oriented with materialized views. All of these methods result in little access to FRESH data. Furthermore, they sacrifice data accuracy and quality to achieve super-low query latency. Reduction in latency is excellent if you aren’t concerned about data accuracy or flexibility (i.e., the ability to update, insert, or change your views of data as needed). Essentially it works if you’re still using data to feed your batch-oriented dashboards.

This strategy is not ideal when it comes to real-time prediction. Predictive analysis is about discovering trends and patterns in data and then serving accurate, up-to-date, real-time decisions (perhaps in an ad, a content module, a price, etc.).

To build predictive models, you must feed data into models in the same granularity at which it will be predicted. For example, suppose your business wants to forecast daily sales. In that case, a daily sales prediction model will need to be built with daily-level information. Unfortunately, storing and governing data to make predictions at a daily or more granular level has historically been challenging depending on the industry.

Additionally, a new model built with monthly-level information is necessary if your business wants to predict monthly sales. You COULD sum up the predicted daily sales to estimate a month, however, that estimate will not be as accurate as a dedicated monthly prediction.

Streaming Dataset Limitations:

Streaming big data by itself means nothing. What good is a transaction if you can’t also see the demographic, geographic, or (any)graphic data of the person at the center of that transaction? Of course, history is critical – but we’ve known how to deal with “historical” data through batch processes for decades, so that’s no longer a concern…right?

And if that’s the case, there’s no world in which we would need to (dare we say) *batch* or *cache* streaming data for the sole purpose of combining it with additional datasets, right?

Molecula FeatureBase: The Real-Time Database for Streaming & Historical Data

Batch is dead. Or so they say… But suffice it to say, we’ve all moved into a world where everything is expected on-demand. At Molecula, we’re full believers in real-time data access, and that might make one think we’re 100% devoted to streaming, but in reality, we’re advocates of an accurate, interpretable, real-time data access layer. What does this mean? It incorporates streaming data, of course, but it takes data streams one step further, combining them in real-time with any historical data or making any updates or inserts to a historical dataset in real-time to align with data streams.

Molecula FeatureBase architecture

Figure 1. FeatureBase architecture

Real World Use Case: The Benefits of Combining Streaming and Historical Data

A financial services company creating fraud models was thrilled with the idea that they could live stream transactions via Kafka to analyze those streams for fraud. Unfortunately, those transaction streams of data required additional context. They needed to be joined (or merged) with historical data in real time (i.e., a user table) to determine whether a transaction was fraudulent.

This organization did not have access to a technology that could (in real time) merge their transaction streams with their user data. As a result, they cached their event streams and batch loaded them into a database to analyze alongside a user table. Regrettably, this step resulted in a 24-hour delay in their ability to detect fraud.

In 2021, consumers lost a whopping $5.8 billion to fraud. Instead of waiting 24+ hours, imagine the difference if the financial services industry could combine their transactional streaming and historical data to detect fraud within seconds! That difference is made possible with FeatureBase.

Get Started for Free

Open Source install commands are included below.

open source CODESTART CLOUD TRIAL

git clone https://github.com/FeatureBaseDB/featurebase-examples.git
cd featurebase-examples/docker-example

docker-compose -f docker-compose.yml up -d

# TIP: Disable Docker compose v2 if needed by going to settings..general in Docker Desktop.

git clone https://github.com/FeatureBaseDB/featurebase-examples.git
cd featurebase-examples/docker-example

docker-compose -f docker-compose.yml up -d

# TIP: Disable Docker compose v2 if needed by going to settings..general in Docker Desktop.