AI: Discover the fascinating story of FeatureBase's inception and technological advancements in this AI-transcribed interview by Kord Campbell with Patrick O'Keeffe, Vice President of Engineering at FeatureBase. Patrick shares his insights with Kord into FeatureBase's origins as a database, the inner workings of their groundbreaking technology, and the game-changing impact of FeatureBase's new dense vector support, propelling the company into the forefront of the generative AI markets.
Kord: Clearly, the AI loves hyperbole. That said, we are excited to talk about our new vector support in FeatureBase. I figured it would be fun to sit down and "interview" Pat, then run it through ChatGPT after transcribing the audio with Whisper. We hope you enjoy the results...by the way we have the full audio available below.
Kord: Oh boy, it's already running ads. Anyway, howdy Patrick!
Patrick: How ya goin'?
Kord: I'm doing great. Can you kick us off by telling the story about how FeatureBase was founded and share your personal journey in joining the company?
Patrick: Certainly! FeatureBase originated before my time with the company. Initially, there was a firm called Umbel that focused on data aggregation in the sports and entertainment domain. They had developed a solution based on traditional search technology, but it wasn't delivering the real-time website experience they desired. Recognizing the limitations, they decided to build a new platform using bitmap storage for data and achieve significantly faster query response times. This endeavor proved highly successful, prompting the spin-off of FeatureBase as a separate entity.
As for myself, I joined FeatureBase nearly two years ago with the initial responsibility of establishing and expanding the engineering team. Additionally, I was tasked with developing the cloud offering, which was still in its early stages at the time.
Kord: So, your expertise lies in software engineering. Could you elaborate on your specific skill set and the types of software you specialize in?
Patrick: Yes, my background primarily revolves around engineering, and I have held software leadership positions for a considerable period. In my previous role, I worked at an enterprise software company called Quest Software, where I led the engineering team. Our product portfolio focused on creating tools for data professionals. One prominent tool we developed was Toad, which gained significant popularity among Oracle database users over the past 25 years.
Kord: Turning our attention to FeatureBase, what sets us apart from other storage solutions, especially from the perspective of our customers?
Patrick: Absolutely. As I mentioned earlier, bitmaps play a crucial role in our approach. While bitmap indexes are not a novel concept in computer science, we have taken a unique approach by storing the data itself as bitmaps. This technique is particularly beneficial for datasets with high or medium cardinality. By leveraging bitmaps for storage, we can perform analytical and aggregate queries with remarkable speed, even when dealing with datasets comprising millions or billions of rows. It's important to note that our capabilities go beyond just utilizing bitmaps; we offer a comprehensive solution that encompasses various innovative features.
Our approach to storing data involves using a technique called Roaring. It allows us to selectively decide how to store specific sets of data. We can compress certain parts for better performance or use run length encoding when appropriate. Roaring incorporates various techniques that enable fast query times and simultaneous data updates. To facilitate efficient updates, we store these Roaring containers in a bitmap tree. This is particularly advantageous as bitmap databases and indexes have traditionally struggled with updates.
Kord: How long did it take to develop the underlying storage, including cloud scalability?
Patrick: It took several years. The initial development of the technology itself didn't take too long to reach a viable stage for customers. However, building the necessary tooling, ensuring data security, and addressing operational requirements to run a database took a significant amount of time. Overall, we're talking about years.
Kord: I imagine it was quite a team effort. How large was the team?
Patrick: Our engineering team has been comprised of around 20 to 30 individuals for the past two years. Prior to that, we had a smaller sized team for a few years.
Kord: So, around 70+ person-years in total?
Patrick: Yeah, a lot!
Kord: We recently implemented SQL, and I'm curious about its significance. Why should I care about SQL?
Patrick: Well, one of our goals was to ensure easy adoption. If your primary focus involves storing and retrieving data, then supporting SQL is crucial. SQL is the default language for every data professional worldwide. If you want to be easily adopted, you can't expect users to learn something new. Time is valuable, and nobody has extra time for that, right? So, one major driver for SQL implementation is to enable users to leverage their existing knowledge. By providing a SQL language layer, we facilitate reusability.
Additionally, I'd like to emphasize the importance of SQL in terms of integration. There are numerous tools available that are well-versed in SQL. If you have integration tasks where you need to transfer data from one place to another, chances are you'll be using SQL in some way, even if you have to write code. Similarly, when querying data or connecting a dashboard, SQL becomes a fundamental component. So, another significant advantage of SQL is its role in ensuring ease of integration.
Lastly, we looked inwardly and noticed several disjointed APIs and methods within our system. By unifying everything under a SQL language layer, we addressed this issue. Our key objective is to enable users to perform any action in FeatureBase using SQL. Whether it's retrieving or storing data, querying internal state, or managing internal state, SQL should be capable of handling it all.
Kord: Yeah, that makes sense. We briefly touched on B-trees and roaring, but I'm curious, why is FeatureBase so fast?
Patrick: It's a bit challenging to explain in a single sentence, but let's try visualizing it this way. Imagine you have a table or dataset with three columns: a key column (a numeric identifier), an integer column, and a string column. Now, let's say you populate this table with a million rows. In a traditional store, let's assume a row store for simplicity, each row's bytes would consist of the key value, the integer value, and the string value. Even if various database vendors employ technologies to compress the data, ultimately you would end up with a million rows, each occupying a hundred bytes in memory or on disk.
However, FeatureBase takes a different approach. Consider the key value as an offset into a bitmap. Here's where interesting things start to happen. Instead of storing the entire string value repeatedly, we utilize a technique called dictionary compression. This means storing the string value once and getting back a smaller corresponding number. Moreover, this number can be efficiently stored in the bitmap. Consequently, even a million rows of string data get significantly compressed. When you need to query or reason over this data, instead of reading each value and performing comparisons, you can leverage bitmap operations, leading to considerably faster processing. Additionally, since the data is stored in a minimal form, you don't have to access numerous memory locations or disk pages to retrieve the required data. This is why we achieve speed in handling such types of queries.
Kord: So, if I understand correctly, if I store an integer value and need to perform operations on specific strings associated with it, I'll only be operating on a small amount of data. By utilizing the bitmap and its speed, I can efficiently filter the data in advance and retrieve the desired string lookups.
Kord: So, can you explain what vectors are and how can I obtain one?
Patrick: Ah, vectors, they're like magic. <laughs> In simple terms, a vector is an array of floating-point numbers. If you're a programmer, you can think of it as an array of floating-point numbers in any language you prefer. The key difference with vectors is that they typically have a declared dimension or length. For instance, if you have a vector with a dimensionality of 3, and it represents a point in Cartesian space, you essentially have an array of three floating-point numbers. That's what a vector is at its core.
Now, in the realm of AI, vectors play a fascinating role. They provide a convenient way of representing various things, such as images or pieces of text, as a series of numbers. Once you have these numbers, you can perform a multitude of operations. For example, you can compare how similar two things are by comparing their respective vectors in a specific manner.
Kord: So, if I understand correctly, vectors are generated through different inference steps of algorithms or models. Let's say I send an image to a visual model, and it outputs a vector, right?
Patrick: Yes, indeed. Vectors can be obtained from various sources. Going back to the AI and machine learning domain, they are widely used. However, you can get vectors from different observations as well. For instance, you can measure the position of an object in 3D space, resulting in a vector with a dimensionality of 3. Furthermore, you can obtain vectors through machine learning model inference. For example, you can send an image to an API and ask it to provide you with a series of numbers. In the case of OpenAI's publicly available embeddings API, if you give it a piece of text, it will return a dense vector of 1536 floating-point numbers that represents that text.
Kord: Ah, a dense vector.
Patrick: Yes, to represent that piece of text, for example.
Kord: So, does FeatureBase support vectors?
Patrick: Yes, it does. We've discussed the storage aspect a lot, including the use of bitmaps. However, there are cases where storing certain types of string data in bitmaps might not be the most efficient, especially if you need to perform fast point lookups or comparison operations. In such scenarios, we can store string values or vectors in a slightly different manner, but still using a binary tree underneath the data storage layer. As a result, our storage engine can now support storing vectors, which again are just sets of floating-point numbers of varying lengths.
Kord: That's interesting. So, can we combine regular field operations with vector queries? Can we compare one vector to another?
Patrick: Absolutely. Since vectors are just another column in a table for us, you have the same capabilities as with regular FeatureBase fields. You can filter and run queries against the table to retrieve vectors or columns containing vectors of interest. You can perform standard similarity calculations like cosine or Euclidean distance on two vectors. Mathematical operations can be applied to them, and you can iterate through their values as if they were arrays. Currently, we don't have an approximate nearest neighbors search feature available to customers, but it will be added soon.
Kord: That's great news. So, it means that, like other vector engines in the market that store dense vectors for large language models, like OpenAI's GPT-4 or models like BERT, Claud from Anthropic, or Google's BART, we can encode those vectors and store them and perform operations on them, right?
Patrick: Absolutely, you're correct.
Kord: Fantastic. So, will we be adding this capability to our cloud offering in the near future?
Patrick: Yes, indeed. Within the next few weeks, customers will have access to this capability and can start utilizing it.
Kord: That's great to hear.
Stay tuned for more announcements around our support for vector search and your favorite language models!