Data aggregation is the process whereby raw data is gathered and presented in a summarized format for statistical analysis. Typically this summary is generated dynamically with a slow user query, but it can also be done beforehand by denormalizing the data. Denormalization is the process of running the query preemptively and building a dedicated dataset for a query or use case. This may sound like a simple process, but in practice, it is often extremely complex (and highly inefficient).
For decades, enterprise organizations have built their data infrastructures with the help of aggregates. Their use improves query response times but relies on duplicating datasets which greatly increases the overhead and complexity of data pipelines and data governance.
Aggregates create a number of problems. The direct cost of increasing query responsiveness with pre-aggregation is complexity. With every pre-aggregation comes a duplication of data which needs more resources, and another process in the data pipeline which has the potential to halt the business. Worst of all is the hidden truth: data once used is never deleted, so the complexity and risks only compound on each other.
But aggregations also force users to lock in the data they need to view within the materialized table(s). They assume that one can predict the proper aggregations for numerous use cases and, in doing so, prevent data bottlenecks, but this is rarely the case in real-life situations.
So what does a typical data aggregation example life cycle look like? It begins with a business unit or data science department making a request to IT for a new data set with certain fields. IT builds a new processing pipeline to copy existing data into the new dataset. In savvy organizations, this whole process might take just 12 hours, but for others, it might take months of development costs and opportunity costs with no business value to show for it. In both cases, the results are the same: high costs of personnel and infrastructure, all for the prize of stale data.
In Figure 1, we show a data warehouse example and the aggregation between tables that occur in order to achieve materialized views of relevant data. With each step of aggregation, costs grow exponentially, while the value achieved from data continues to grow at a linear rate.
As long as data aggregation is used as a means to prepare data, we will never achieve real-time access to data, and machine learning initiatives (which often require predictions to be served in real-time) will be stunted.
One of the superpowers of FeatureBase is its ability to map data in ways that are hyper-optimized for both analytical workloads and statistical computations. Major performance improvements can be made through the mapping and feature table structure in FeatureBase depending on:
FeatureBase’s high-performance database offers unique capabilities with multi-valued fields that can result in unbelievable performance for otherwise difficult queries.
As you can see, FeatureBase is able to simplify tables for query through its unique data mapping capabilities. The FeatureBase technology, due to its format, has the capability to combine a multitude of traditional SQL tables into a single table, in order to handle complex queries at low-latency, thereby reducing the need for any form of pre-aggregation of tables, and creating major efficiencies in the process.
Pre-aggregation is a solution to an inherently human-centric view of data — it is the way of the past. It relies on adding new moving parts to an already complex process and puts the weight of business value on hard-to-hire data engineering personnel. As we move towards the implementation of artificial intelligence, it’s time to start designing datasets with models (not people) in mind. This can be a difficult thought process to adopt, but FeatureBase is the user and machine-friendly database for your data science endeavors.