Bitmaps: Making Real-Time Analytics... Real

Posit: Current analytics solutions suck—the typical analytics stack is a dizzying assortment of tools and databases trying to meet all the various needs a typical organization has. We need to rethink our databases from the storage engine up, which is exactly what we have done with FeatureBase. FeatureBase is a real-time database built entirely on bitmaps. It is designed primarily for speed and horizontal scalability, and is particularly well-suited for workloads that require many real-time updates, inserts, and deletes on massive datasets.

To be clear, analytics, as far as we're concerned, is a broad categorization of database workloads which involve answering questions that deal with large portions of a data set.

Analytics: What's the average age of all my customers?

Not Analytics: Jane just bought some shoes, insert this transaction into the DB.

Analytics: Give me a breakdown of sales by state and season.

Not Analytics: Fetch me Jane's address.

When you have a small static data set, analytics is fun! You can ask all sorts of intricate questions and it doesn't really matter what database you use, most of them are happy to answer your questions faster than you can blink.

Throw a few zeros on the data size, though, and things start to get less... fun. Ever been unsure if the quarterly report will finish running before your boss wants it in the morning? And if it does so in Q1, will it still in Q2?

When this happens, people start to employ a variety of tricks. These tricks have actually evolved into a whole discipline called "Data Engineering". Data engineers are modern day cowboys, skilled at manipulating large herds of machines to process massive amounts of data in parallel. Not necessarily to actually *do* the analytics mind you, but to get the data cleaned, reformatted, pre-aggregated, and ultimately into the right kind of data store for analysis to be done. This all gets especially tricky when the data comes to life with a large volume of updates, inserts, and deletes.

There are a bevy of tricks wrapped up in a modern data pipeline depending on the needs of the end user. For some applications, queries have to return fast, some prize flexibility at the expense of speed, some are highly cost sensitive, and some need data to be very fresh: ready to be analyzed almost as soon as it's generated.

At FeatureBase, we believe that the current generation of tools and tricks for doing analytics aren't good enough. No technology currently exists whose trade-offs aren't too punishing in some dimension. Either queries are too slow, or data is too stale, or it's too expensive, or it simply can't keep up with the torrents of ever-shifting data that pour into our systems. Then to make things worse, we have to copy and move data into staging tables, other databases, SaaS platforms, and through a myriad of other systems creating a brittle and complex gap between our raw data and the application specific data. The existing infrastructure for analytics is broken.

A new approach is necessary to overcome the fundamental limitations of analytics today and make things more efficient at a deep level. We've been investing heavily in bitmap-based data storage for nearly a decade now, and we are confident that it will completely reshape the modern analytics stack.

Basics of Bitmaps

Once only a niche indexing technology suitable for specific workloads on low-cardinality fields, bitmaps have come a long way in the last decade. Let's dive into what a bitmap-based database storage engine looks like, and why we believe it has fundamental advantages that are applicable to most analytics workloads given a sufficiently powerful bitmap-aware query planner and optimizer.

To understand why bitmaps are such a powerful building block for an analytical system, we'll have to talk a little bit about how a typical data set is actually represented.

Let's use a simple example table:

In a typical transactional database, the data would be laid out on disk much like it is presented above. Each row is written in its entirety in a particular place and one row is appended to another.

In a traditional columnar database or data format (the usual choice for analytics), the data would be laid out more like this:

There is a problem here, however, in that each user can have multiple hobbies. Most databases would not represent this within the column, but with a separate table which contained the many-to-many relation like so:

This means that to query information about hobbies, you have to do a JOIN between the original table and the hobbies table which can often have a significant performance impact. 

However, when we represent the data using bitmaps, things look very different. Similar to columnar, we store each column separately–here's what hobbies would look like:

It's very natural and efficient to represent a whole set of values being associated with one record using bitmaps. Here's how favorite ice cream flavor looks in the bitmap representation:

The two *distinct values* ('chocolate, 'vanilla') are represented separately, and each has a bitmap. The length of the bitmap is the total number of records in the table. Each user_id corresponds to a position in the bitmap. Looking at the bitmap for chocolate, we see that user_id 0 has a favorite flavor of chocolate because the bit is '1', while user_ids 1 and 2 do not have a favorite flavor of chocolate because those bits are 0.

Try to mentally match up the bitmap representations with the data in the original table... it can take some practice, but once you start to think about data this way, you can see why analytical queries can be so much faster.

The important thing to realize is that laying out data in this way is different from both the familiar row-oriented layout, *and* the columnar layout. We are able to scan only the data about which users like chocolate without looking at all the favorite ice cream flavor data. Furthermore, that data is stored in an extremely compact representation, which is cache-friendly in terms of its access patterns.

So far, we've only scratched the surface: there are different techniques for storing large integer or fixed-point decimal values in bitmaps. There are a lot of tricks employed in FeatureBase to make ingest and updates efficient, sharding for datasets with many records, and use of B-Trees to make access extremely fast for higher cardinality columns. We'll get into these and more in future posts, but right now I want to focus on some higher level thoughts.

Benefits of Bitmaps

Why do we think that this niche indexing technology is the right foundation for general purpose analytics? Certainly there are a lot of hurdles to overcome, but there are good reasons to believe that this is a very promising path.

1. Smaller data footprint.

Ingesting data from a naive format like CSV, we often see a 10x reduction in data footprint. This is due to a couple factors, one is that we only store each unique value that exists within a given column once, and then we map that value to a bitmap. The bitmap itself uses a modified version of a compression scheme called roaring, which can take advantage of both very sparse and very dense data sets and represent them very efficiently.

Saving money on storage is nice, but the bigger impact on performance is less I/O load during queries. This is also helped by the next point...

2. More granular access.

Queries have to scan, not whole tables, nor whole columns, but only the particular values they are interested in. If you want to know how many records in a table have favorite ice cream flavor = chocolate, you scan the chocolate value, not the whole ice cream flavor column. Furthermore, the whole domain of a column (the set of unique values present in that column) is stored separately from the data that represents which records have which values. This means the query planner can be very selective about what storage it reads based on the needs of a particular query.

3. Easily vectorized, CPU-friendly query processing.

Using bitmaps means that many queries can effectively process dozens of records in a *single CPU instruction*. A logical AND of two 64 bit numbers tells you which of 64 records have both of 2 particular values. A single POPCNT instruction tells you how many of a set of 64 records have a particular value. Processors with wider instructions can do even better.

4. Perfect for streaming updates.

Data is always changing and doesn't always arrive clean, complete, and in the right order.. FeatureBase can achieve updates and deletes at high throughput without having to separate reads and writes. Bitmaps, historically, are not known for being amenable to updates, partially because they are used as an auxiliary structure that must be updated in addition to the main data representation, and partially because the nature of compressed bitmaps means that a small change can necessitate rewriting a large amount of data. FeatureBase solves this by (1) *only* storing the indexed version of the data... there is no other data to update, and (2) throwing an incredible amount of engineering effort into making a not-so-update-friendly data structure amenable to updates. We'll dive deep on #2 in future posts, but suffice it to say that FeatureBase has been designed to be updatable on the fly from its very early days. Being able to do fast updates has huge implications for large, continuously changing datasets in efficiency, storage footprint, and simplicity. The ability to consider fresh streaming data in the context of all your historical data at interactive query speeds will open up use cases that many have never even considered.

5. Fast.

Trust me, analytics on bitmaps is as fast as it gets. Download, try, or join the FeatureBase Community to see bitmaps in action.

So in conclusion, to get to true real-time analytics, we can't just get better at scaling up or scaling out. Current technologies now excel at allowing users to spend more in order to get better performance, but the next generation of technologies needs to break efficiency barriers and open up new use cases. Scaling out works to a point, but the more machines you have, the higher the floor is on the latencies you can practically achieve. A more efficient technology is needed, and we believe that is an OLAP database built on bitmaps! Read our Bitmap Manifesto to learn more about us.

Interested in seeing firsthand what bitmaps can do for analytical workloads? Download FeatureBase today!