For Data Engineers

Star Schema Benchmark: How to Model Multidimensional Data in FeatureBase

Column-oriented databases are no longer enough to power today’s massive analytical workloads at the speed required. In this blog, learn how to model multidimensional data in FeatureBase, a feature-oriented database platform that powers real-time analytics and machine learning applications by simultaneously executing low-latency, high-throughput, and highly concurrent workloads.

FeatureBase revolutionizes data analytics with a feature-oriented format that makes each distinct data value individually addressable. In FeatureBase, data is denormalized into a feature-oriented format that compresses data up to 100x smaller than the original source. This unique approach minimizes I/O on queries by allowing the database engine to read and write exactly the data it needs and intelligently compress that data in memory without copying data. FeatureBase is ACID compliant, assuring no loss of data due to compaction, storage, and retrieval techniques.

How to model multidimensional data in FeatureBase | Star Schema Benchmark

This blog walks you through how to model multidimensional data in FeatureBase using data from the well-known Star Schema Benchmark (SSB). As you might guess, this dataset is modeled in a “star schema,” which has traditionally been a popular way to model event data in analytical databases. At a high level, a “star schema” is comprised of a fact table that references multiple dimension tables. The fact table is often a denormalized SQL table of static events that an organization wants to analyze. The dimension tables provide information related to events such as people, products, and places in order to filter, label, or group facts.

In the SSB dataset, there is one fact table called LINEORDER. Each record is uniquely identified by an orderkey and linenumber. In this fact table, there are five foreign key columns that reference four dimension tables: customer, part, supplier, dwdate. Using the foreign keys, the LINEORDER table can be analyzed or used from multiple angles to achieve a variety of business needs. The data model is depicted in Figure 1.

how to model data using FeatureBase with star schema benchmark dataset

Figure 1. Star Schema Benchmark fact and dimension data depicted in a Star Schema model (Source: AWS)

Relational tables like those in Figure 1 are mapped to entity-based tables in FeatureBase and only features are extracted and stored. Data may be ingested into FeatureBase in multiple ways to achieve a specific analytical use case. Most customers see significant data footprint reduction with FeatureBase even when data are ingested in numerous ways due to the feature-oriented format. In FeatureBase, fields from the relational model may be mapped to a single table.  

Organizing the data into multiple FeatureTables will result in a wider and more flexible set of use cases (Figure 2). For example, a distribution organization is typically concerned with various entities – customers, suppliers, orders, etc. – and may want to analyze each in different ways. A FeatureBase table may need to be created for each entity of interest. On the other hand, an organization that is purely customer-focused may not need data broken out into multiple entities.

star schema benchkmark example using FeatureBase feature-oriented format

Figure 2. Multiple entity-based tables to solve a wide variety of use cases

Its feature-oriented format allows FeatureBase to intelligently collapse data schemas to eliminate computation-heavy JOINs. Even when multiple entity-based tables and JOINs are needed, query performance is not lowered. FeatureBase supports multiple data types, most of which are familiar to data practitioners like string, integer, timestamp, and boolean. Set fields, or multi-valued fields, are one way that FeatureBase can collapse traditional data models.

How to model multidimensional data in FeatureBase | Set Fields

A special power of FeatureBase is that it allows fields to be multi-valued, or set fields. Set Fields allow FeatureBase to collapse traditional data models, like the star schema, by efficiently storing multiple values for a single field. A many-to-one mapping in a traditional data model compared to a tabular view of a multi-valued set field in FeatureBase is shown in Figure 3. Using the SSB data as an example, all parts associated with a single supplier could be stored in one cell.

The set field capability allows FeatureBase to collapse multi-dimensional data models for more efficient storage and querying. Unlike other databases, FeatureBase can interact with each element in the multi-valued field without parsing from an underlying array. This unique ability allows users to add course-grain timestamps to each element in the multi-valued field independently of each other or to add or remove associations from each element without needing to interact with object arrays.

How to Model Multidimensional Data in FeatureBase 

Figure 3. Many-to-One mapping in a traditional data model and compressed in FeatureBase using a Set Field.

How to model multidimensional data in FeatureBase | Streaming data

FeatureBase supports streaming data that can be seamlessly combined with historical data. FeatureBase employs a flexible data model. As long as new attributes are mapped to an entity, adding or removing fields from a schema does not disturb existing pipelines and downstream applications.

Furthermore, FeatureBase can quickly perform ‘upserts on the fly’ to insert or update records as they stream in at speeds over 1 million records per second without the need for preprocessing. This benefit ensures data is fresh and available to query within seconds of ingestion. Additionally, ingestion and query resources are decoupled, eliminating potential tradeoffs between throughput and latency that may occur when the two types of jobs compete for the same pool of resources.

What’s Next?

Molecula FeatureBase is a real-time database powerful enough for even the most extensive production workloads. FeatureBase uses a feature-oriented format to intelligently compress data into its smallest addressable units to minimize I/O and yield 10-100x price-performance improvements over columnar databases.

Get Started for Free

Open Source install commands are included below.

open source CODESTART CLOUD TRIAL

git clone https://github.com/FeatureBaseDB/featurebase-examples.git
cd featurebase-examples/docker-example

docker-compose -f docker-compose.yml up -d

# TIP: Disable Docker compose v2 if needed by going to settings..general in Docker Desktop.

git clone https://github.com/FeatureBaseDB/featurebase-examples.git
cd featurebase-examples/docker-example

docker-compose -f docker-compose.yml up -d

# TIP: Disable Docker compose v2 if needed by going to settings..general in Docker Desktop.