For Data Engineers

Billion Taxi Ride Dataset

FeatureBase was originally built for the very specific use case of arbitrary audience segmentation on hundreds of millions of people with tens of millions of attributes. As we’ve built our product, a natural first step was to test its efficacy on other types of problems.

While the user segmentation use case has a very high number of attributes, the data is extremely sparse—most attributes are only associated with a handful of people, and most individuals only have a few hundred or a few thousand attributes. FeatureBase handles this type of data gracefully; in just milliseconds, one can choose any boolean combination of the millions of attributes and find the segment of users which satisfies that combination.

Image courtesy of Umbel

For audience segmentation, we’d happily pit Pilosa against anything out there on similar hardware. However, if we want Pilosa to be an index which serves as a general purpose query acceleration layer over other data stores, it will have to be effective at more than just segmentation queries, and more than just sparse, high cardinality data. In order to test Pilosa, we needed a dataset which had dense, lower cardinality data, and ideally one which has been explored by other solutions so that we could get a feel for how we might fare against them. The billion taxi ride dataset fit the bill perfectly. There are myriad blog posts analyzing the dataset with various technologies; here are just a few:

Of particular usefulness to us are the series of posts by Mark Litwintschik and the performance comparison table he compiled. Here is the table for reference:

We implemented the same four queries against FeatureBase so that we could get some comparison of performance against other solutions.

Here is the English description of each query:

  1. How many rides are there with each cab type?
  2. For each number of passengers, what is the average cost of all taxi rides?
  3. In each year, and for each passenger count, how many taxi rides were there?
  4. For each combination of passenger count, year, and trip distance, how many rides were there - results ordered by year, and then number of rides descending.

Now we have a dataset and a set of queries which are far outside FeatureBases comfort zone, along with a long list of performance comparisons on different combinations of hardware and software.

So instead of tens of millions of boolean attributes, we have just a few attributes with more complex types: integers, floating point, timestamp, etc. In short, this data is far more suitable to a relational database than the data for which FeatureBase was designed.

When we originally ran this, we did it on Pilosa -- see below for how we approached it, but please note, we've made many improvements with FeatureBase.

We stood up a 3-node Pilosa cluster on AWS c4.8xlarge instances, with an additional c4.8xlarge to load the data. We used our open source pdk tool to load the data into Pilosa with the following arguments:

pdk taxi -b 2000000 -c 50 -p <pilosa-host-ip>:10101 -f
<pdk_repo_location>/usecase/taxi/greenAndYellowUrls.txt

This took about 2 hours and 50 minutes, which includes downloading all of the csv files from S3, parsing them, and loading the data into Pilosa.

If we were to add our results to Mark’s table, it would look like the following:

Note that the hardware and software are different for each setup, so direct comparisons are difficult.

We should note that Pilosa “cheats” a bit on query 1; due to the way it stores data, Pilosa already has this result precomputed, so the query time is mostly network latency.

For the remainder of the queries, however, Pilosa does remarkably well—in some cases beating out exotic hardware such as multi-GPU setups. The 0.177s time on query 3 was particularly startling—performance was along the lines of 8 Nvidia Pascal Titan Xs. It looks like kdb+/q is beating us pretty soundly, but keep in mind that those Xeon Phi 7210s have 256 hardware threads per chip, as well as 16GB of memory on the package. This gives them performance and memory bandwidth closer to GPUs than CPUs. They’re also about $2400 a piece.

For us, these results are enough to validate spending more time optimizing Pilosa for uses outside of its original intent. We know that Pilosa’s internal bitmap compression format is not optimized for dense data, and more research has been done in this area with exciting results (e.g. roaring-run), so we have reason to believe that there is significant room for improvement in these numbers.

UPDATE: After implementing run-length encoding, and some other optimizations, Pilosa is significantly faster on many of these queries and beats every other system on Query 3!

SCHEDULE A DEMO