Latest posts
What if we just didn’t decompress it?
Mar 5, 2025 · by Nicholas Gates · 7 min read
All columnar file formats support the idea of projection push-down. That is,
only reading columns that are requested by the reader or are required to
evaluate any filter predicates.
Most columnar file formats support the idea of predicate push-down. That is,
using Zone Maps to
prune large chunks of rows whose statistics provably fail the predicate.
Vortex is unique in
the way it evaluates filter and projection expressions by supporting full
compute push-down, in many cases avoiding decompression entirely.
Scalar Compute
Most compute functions in Vortex are “scalar”, meaning they operate
element-wise, and are very often invoked with a literal value on the right-hand
side. For example,
a < 10 is much more common than a < b where both sides
are arrays.🤓 Examples of non-scalar compute functions aresum,min,max,binary_search, etc. These functions can also benefit from push-down, but in a tradition harking back to university, I will leave this as an exercise for the reader
It is not too hard to see how scalar compute functions can be pushed-down
through simple light-weight compression schemes.
Constant Encoding
Let’s start with the simplest compression codec of them all, constant. If all
values of an array are the same, Vortex collapses the data into a
ConstantArray holding only a Scalar and a len.In this case, we can perform all scalar compute in constant time, with the
result also being a
ConstantArray.Rust
def add(array: ConstantArray, rhs: Scalar) -> ConstantArray:
return ConstantArray(array.scalar().add(rhs), len=len(array))
Dictionary Encoding
Another very common type of compression is dictionary encoding. This array holds
one child containing (typically) unique values, and a second child containing
pointers into those values.
Hopefully by now you can guess how one might perform scalar compute over
dictionary arrays!
Python
def add(array: DictArray, rhs: Scalar) -> DictArray:
return DictArray(values=array.values().add(rhs), codes=array.codes())
Scalar compute runs in time proportional to the number of unique values, rather
than the length of the array. This can vastly reduce runtime, particularly for
complex functions and large data types like strings.
ALP Encoding
Perhaps a less well-known compression codec is
Adaptive Lossless Floating-Point.
In this codec, two exponents
e and f are chosen such that floating point
numbers can be stored losslessly as integers. For example, 1.234 might be
stored as 1234. The details are very interesting, and our MIT-licensed Rust
implementation is
available on GitHub,
but all you really need to know is that decoding a single float from ALP takes
two floating-point multiplications.Python
def alp_decode(a):
a * 10.0^f * 10.0^-e
While not all operations can be pushed down over ALP, many can. For example, a
comparison such as
a < 1.567 can be turned into a < 1567 provided that
1.567 successfully round-trips the ALP compression with the given e and f
exponents.Running some quick benchmarks,
a < 1567_i64 is ~40% (0.6) faster than
alp_decode(a) < 1.567_f64 , and so it makes sense for us to push-down
comparisons over the encoded ALP integers rather than decompress the floats.Hierarchical Compression
Things get even more exciting when we consider that Vortex uses cascading
hierarchical compression. What starts life as a float array, may end up as
Dict + ALP + FastLanes BitPacked array and benefit from several layers of
compute push-down.
💡 Recall that FastLanes BitPacking enables us to pack integers into their smallest bit-width with incredible SIMD acceleration.
Let’s set aside dictionary push-down for now (given it is asymptotically
beneficial in almost all cases) and focus on ALP + BitPacked.
The output of the Vortex command line helpfully describes the structure of the
l_discount column of the
TPC-H dataset.Python
root: vortex.alp(f64, len=8388608) nbytes=4.19 MB (100.00%)
metadata: ALPMetadata { exponents: Exponents { e: 14, f: 12 } }
encoded: fastlanes.bitpacked(i64, len=8388608) nbytes=4.19 MB (100.00%)
metadata: BitPackedMetadata { bit_width: 4, offset: 0, patches: None }
buffer (align=8): 4.19 MB
In Parquet, a simple
a < 10.0 operation first requires page decompression
(typically a general purpose compressor like LZ4 or SNAPPY), followed by
dictionary decoding, and finally running the floating point comparison.Vortex avoids opaque general purpose compressors, meaning we are able to
directly memory-map the hierarchical array before performing our push-down
compute. Once the compressed array is in memory, we push the operator through
ALP encoding into integer space arriving at the BitPacked array.
SIMD, SIMD, SIMD
And this is where things get even more interesting. Much of the performance of
lightweight compression comes from heavy use of SIMD instructions. My MacBook
has 128-bit SIMD registers, meaning it can run 2 x 64-bit, 4 x 32-bit, 8 x
16-bit or 16 x 8-bit instructions at the same time.
As you might expect, code written to benefit from SIMD acceleration sees
practically linear scaling with these factors. That is, code operating over i8s
runs ~8x faster than code running over i64s.
Take our bit-packed integer array from
l_discountPython
fastlanes.bitpacked(i64, len=8388608) nbytes=4.19 MB (100.00%)
metadata: BitPackedMetadata { bit_width: 4, offset: 0, patches: None }
buffer (align=8): 4.19 MB
We see that 8.3 million i64 values are bit-packed into 4-bit unsigned integers,
with no patched exceptions. That means all of the values are
between
0 <= x < 16.Since we’re performing a comparison of
a < 10 , rather than any sort of
integer arithmetic, we don’t care about overflow semantics. Vortex therefore
performs the comparison over the narrowest plausible integer type, in this case
a u8.From our baseline of decompressing 8.3 million ALP-encoded floats (1.0), we push
the comparison into 2xi64 (recall my MacBook’s 128-bit SIMD registers) integer
space (0.6), and finally into 16xu8 space (0.075), for a total 83%
improvement!
Can we go further?
I have to admit that Vortex doesn’t currently perform the last step of running
the comparison in u8-space, but the architecture of Vortex means it’s very easy
to add and is coming soon!
Besides this, we can ask ourselves whether there’s even more room for
improvement? Much of what we’ve discussed here can be further optimized to make
better use of CPU caching. By fusing the logic of bit-unpacking + comparison
operators, we are seeing early indications of another ~40-50% performance
improvement and will be talking more about this soon.
In the meantime, we are busy working away to stabilize the initial Vortex file
format ready for release into the world. Reach out
if you’re facing problems in this space, or if you’re interesting in joining our
in-person teams in London 🇬🇧 + NYC 🇺🇸