Trick-or-treating with Vortex

Oct 31, 2024 · by Andrew Duffy · 9 min read

Here at Spiral, we’re building the data warehouse for the age of AI. We’ve been hard at work on Vortex, our joint file-memory format for wrangling large arrays. Building a new analytic file format is no small task, and our team has been hard at work across the board, from compression codecs and performance tuning all the way to improvements in the Python SDK and documentation.

With the spirit of Halloween in the air, we wanted to take a moment to reflect on all the progress we've made this month.

Treat 🍫: German Strings

Section illustration introducing the German Strings update.

In #1082, we introduced a new canonical string format for Vortex: the variable-length binary view, which we affectionately call the VarBinView.

If you’ve watched an Andy Pavlo lecture or keep up with database lore, you might have also heard these referred to as “German Strings”, so named because they were introduced by the folks at TUM in their paper describing the Umbra database. Much has been written about the benefits of German Strings for query processing, and we encourage you to checkout recent articles by the folks over at InfluxData as well as Polars.

It’s been just over a year since Apache Arrow added an array type to represent German Strings (which arrow-rs calls StringView ). We added support to Vortex shortly afterward, however due to missing support in Apache DataFusion and PyArrow, we were unable to make it our default Arrow-compatible encoding for strings. Over the course of this year however, the DataFusion contributors have done tremendous work to improve support for StringViews. We were also glad to do our part, contributing bug-fixes and performance improvements like #6077, #6168, #6171 and #6368. It was a glorious day when we managed to get all TPC-H and PyVortex tests running with VarBinView 🎉

It’s still early days, but now Vortex users that need to share data via Arrow can do so using the very best string format around.

Trick 👻: Filtering compressed strings in-place

In our benchmarks, we've found that our statistics-based compressor will pick FSST to compress most string datasets. It’s fast and effective, and this month we landed a change to make it even better.

FSST uses a symbol table for compression and decompression. Given the same symbol table, the same string will compress into the same binary code sequence deterministically. As part of #1082, we changed the evaluation logic of = or != predicates over FSST-encoded arrays. Previously, we would decompress into an Arrow string array and evaluating using Arrow compute kernels. We now compress the comparison value using the symbol table, and execute the operation over the compressed array.

This cut the latency of certain TPC-H queries in half, purely by avoiding the need to decompress.

Treat 🍫: ALP-RD for high-precision float data

Floating-point numbers can encode up to 64 bits of precision. However, many real-world datasets use only a fraction of that range. Stock prices, temperature measurements and wind speeds are all relatively low-precision, and with some cleverness you can reclaim some of those bits.

The Adaptive Lossless floating-Point algorithm (“ALP” for short) allows Vortex to store low-precision floating point numbers as small integers, which we can then encode using one of several lightweight integer encodings such as frame-of-reference and bit-packing. We implemented ALP support way back in #72, and it’s served us these past several months as our only floating point compression codec.

However, sometimes you do encounter a dataset that calls for high precision, whether it’s geospatial coordinates or high-energy physics.

This month, landing in #947 we implemented the “ALP for real doubles” extension, also known as ALP-RD. This variant exploits the fact that vectors of high-precision numbers often share the same exponent and initial bits of precision, allowing us to identify a good cut point to divide all of the vectors into left and right halves. We can pack the right-bits down from 64 to whatever the cut-point dictates, while the left bits can first be dictionary-encoded, and then bit-packed.

As an example, say we take an array of longitudes for 10 American cities:

Table of 10 American city longitudes shown in 64-bit binary; the leftmost bits visibly cluster across rows while the rightmost bits look random.

Look closely at the binary: can you see it? It’s clear that the left-most bits for the values tend to look the same, while the right-most bits seem fairly random. If we choose a good split point, we can deduplicate some of this recurring information.

Let’s take bits 0-15 to be the left, and bits 16-63 as the right:

Same longitude table split into a left half (bits 0–15) and a right half (bits 16–63), highlighting the 6 unique left-bit patterns.

There are 6 unique values on the left, which we can assign as dictionary codes 0 through 5.

Same table after dictionary-encoding the 6 unique left-half patterns into codes 0–5, paired with the unmodified right halves.

Bit-packing the left and right halves individually, we can represent this in 48 + 3 = 51 bits per value versus the original 64, a savings of over 20%, without any loss of precision! There are some other tricks that we won’t cover in this post, but we encourage as always to check out the code, open issues, and ask questions!

Similar to FastLanes and FSST, we’re working to move the core logic behind both ALP and ALP-RD out into their own Rust crate to be easily accessible to the broader community, so stay tuned for updates on that front.

Treat 🍫: File format improvements

The file format has grown up a lot this month, and a big focus has been on adding the necessary infrastructure to allow backwards-compatible changes moving forward.

In #1077 we added version information to our file footers, setting the stage for version-aware readers in the future. And with #1098, we’ve added a new self-describing layout definition with the schema provided inline with the array data. This is important for layouts with optional fields that have to be discovered at runtime. Internally, we’re keen to use this to represent column chunk statistics, where all statistics fields are optional and the set of statistics changes over time.

Separately from versioning support, the other big change for the format is an initial implementation of filter pushdown. This is a critical component of any analytic file format, allowing us to skip both scanning and decoding of file sections that will not be used.

As a side effect of this change we no longer force arbitrary partitioning on read but instead the natural partitioning as defined by the file. We avoid needlessly slicing and recombining data which leads to full scan improvements.

We’re excited to continue extending this work, with further improvements like statistics pruning, index selection and parallel reads coming soon.

While we anticipate a few more breaking changes to the format (we need to finish auditing the encoding-specific metadata) we are laying the groundwork to get to a stable future.

Trick 👻: Compressor performance tuning

When writing a Vortex file, the most expensive part (by far) is compression. We made major improvements in the Compressor this month, which led to much faster writes, smaller files, and faster reads!

We improved the objective function in the sampling compressor, resulting in up to 30% faster compression and up to 70% faster decompression. The key insight here was to add a fixed 64-byte penalty to each array child in estimating size, which helps us avoid choosing a deeper compression tree (more expensive to decompress) to save e.g., 1 byte overall.
We changed the compression strategy for chunked arrays, improving compression throughput by up to 50%. For chunked arrays, rather than “starting over” on each chunk when searching for a good combination of encodings, we now check if the previous chunk’s chosen encodings perform well on the new chunk. If they do, then we simply reuse the same set of encodings.

Treat 🍫: An improved Python integration

The Vortex Python API has also made significant strides this month. PR #1042 enabled Vortex to create Pandas data frames without copying data, leveraging Pandas' support for Arrow extension arrays and data types. Shortly after, PR #1089 brought the PyArrow Dataset API to Vortex files.

The Dataset API serves as a bridge between query engines (like DuckDB and Polars) and file formats (such as Vortex or Parquet). It allows query engines to "push down" column projections and row filters directly into the file format. For columnar formats like Vortex, this means reading fewer columns translates to reading proportionally fewer bytes. In workloads constrained by bandwidth or decoding time, this efficiency boost frees up more processing power for meaningful analytical tasks.

We're also excited to publish our new Documentation!

Treat 🍫: Community contributions

Finally, we also wanted to acknowledge external contributions that we accepted this month. We’re excited to see people in the broader data community take interest and we look forward to more PRs!

35% faster decompression with less boundary check (FSST) by @XiangpengHao (first contribution! 🎉)
feat: add dbtext decompression benchmark (FSST) by @NickCondron (first contribution! 🎉)
Implement a filter compute kernel for dictionaries that avoids eager decompression, also by @XiangpengHao

For what good is a bag of treats if you can't share it with friends?

Happy Halloween!