Latest posts
Logical vs Physical data types
Feb 8, 2025 · by Nicholas Gates · 9 min read
Here at Spiral we are working to build data warehousing for the AI era.
Today, we’re kicking off our series of deep-dives into one important piece: our
open-source file format
called Vortex.
The TLDR for Vortex is:
-
An extensible, state-of-the-art columnar file format
-
A shared data layout on-disk, in-memory, and over-the-wire, which enables zero-copy and (almost) zero-allocation reads
-
Mechanisms to perform compute over compressed data
-
Cascading lightweight compression scheme based on BtrBlocks
-
Configurable file layout (to row-group or not to row-group…?), allowing writers to tune for fast writes, fast reads, small files, few columns, many columns, over-sized columns, etc.
-
Carefully designed for forward compatibility. Vortex may well be the last file format you ever need 🙈
-
Early benchmarks suggest a roughly comparable compression ratio to Parquet (with zstd), with 1-2x write throughput, 2-3x faster scans, and 200x faster random access
The rest of this post will introduce the idea of logical and physical types, and
explain why Vortex leans into a logical type system.
Prior Art
The most common setup for analytical data is to use Apache Parquet as a
compressed atomic file format, to push-down some row filtering and column
pruning to skip irrelevant data, and then to decompress the remainder into
memory for further processing.
Historically, this in-memory format has been specific to the compute engine.
Spark has its own representation, as does Presto, as does Numpy/Pandas. This
meant that sharing data between these systems, for example invoking a Python UDF
from Spark, incurred a huge conversion cost.
Apache Arrow was created in 2016 to solve this problem of interoperability by
providing a common in-memory layout that can be shared, zero-copy, between
languages. It succeeded wildly and in turn spawned
a revitalized ecosystem of
data tooling with relative high performance.
🐍 The Scientific Python stack demonstrated the enormous value of a shared in-memory format with Numpy arrays underpinning almost all Python data projects including Pandas, SciPy, Xarray, and more. It is perhaps no surprise that Wes McKinney, the creator of Pandas, helped create Apache Arrow to bring these benefits to the wider data ecosystem.
Arrow is a very good solution for communicating and sharing data after it has
been scanned, but it is not particularly meant as a storage format.
Uncompressed Arrow data can easily occupy 10x the number of bytes compared to
the compressed representation of on-disk formats like Parquet.
Vortex is a file format and associated in-memory representation that natively
supports compressed data. Compressed Vortex arrays can be loaded from disk,
network, object stores, and even memory-mapped with very little overhead,
allowing us to choose precisely when and how much data to decompress.
We believe this is the right interface between storage and compute. But the
obvious question is... couldn't we just add this to Arrow?
Arrow Has Physical Types
Arrow is built around a physical type system. That means each type of array
has a strict specification for how the data is laid out in memory.
For example, Arrow represents a
StringArray as a single buffer of contiguous
string data, and an additional buffer of offsets signed 32-bit integers pointing
to the start of each string. The array pa.array(['hello', 'world']) would be
stored like this:StringArray
offsets: [0, 5, 10]
data: ['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']
Due to the limit on the maximum value of a 32-bit offset, a
single
StringArray can only store at most ~2.147GB of string data. To work
around this limitation, Arrow also has a LargeStringArray that uses 64-bit
offsets (9.2 exabytes if you were wondering), but these are two different and
incompatible types.When returning Arrow data to a compute engine, such as DataFusion, it is
typically passed via an
ArrowArrayStream —
“a streaming source of data chunks, each with the same schema.”
And here lies the fundamental problem with using physical type systems at the
boundary of storage and compute: it forces all data in a column to be
represented in the same way.
Parquet might compress strings using dictionary encoding for one row group, and
run-length encoding for another row group. Arrow supports
both
DictionaryArray<String> and RunEndArray<String> , but we are forced to
decompress into a common representation.What’s so bad about that?
Let's Get Logical...
Almost all modern compute engines, for
example DuckDB and Velox,
have custom internal representations for
data, allowing them to perform compute over dictionary encoded data and
other partially decompressed arrays.
Notably,
Arrow’s DataFusion compute
engine does not do
this, but soon will.
This unlocks some very powerful optimizations. Suppose we have
a
DictionaryArray with 1 million strings, but only 2 unique
values foo and bar. A filter expression of value == "baz" can very quickly
rule out any matches by performing two comparisons, rather than 1 million.In order to read dictionary encoded data directly from a Parquet file, many of
these compute engines have implemented their own Parquet readers; examples
include Arrow, DuckDB, Velox, Impala,
and cuDF (in
total, I have
found 10 distinct implementations of Parquet!)
Having such a diverse ecosystem is a sign of how successful Parquet has been,
but unfortunately this imposes drag on how fast the format can evolve. If
readers don’t support a new encoding or feature, then writers are more hesitant
to enable it.
By separating logical and physical types at the storage layer (the file format
itself), rather than the compute layer, Vortex is able to return data to
compute engines in whatever format works best for them, all with minimal
conversion overhead.
This helps to future-proof Vortex, allowing us to pick up and support new
compression codecs or new compute engines, all while improving today’s
performance.
Vortex Has Logical Types
Now we know why Vortex opts for a logical type system, let’s see what’s
supported. The
vortex-dtype crate contains the following logical types:-
Integers - signed and unsigned, 8, 16, 32 and 64 bits
-
Floats - 16, 32 and 64 bits
-
Bools
-
UTF-8 and Binary
-
Structs
-
Lists
-
Extension Types - an underlying logical type with optional metadata
There are a few types we’re still missing, for example decimals and unions, but
it’s mostly complete.
...and Physical Encodings
In Vortex, the physical representation of an array is called an encoding.
Encodings are fully extensible and encapsulate logic for interpreting the
array’s memory into the described logical data type. This is a many-to-many
relationship; for example, dictionary encoding can represent any of the logical
data types by de-duplicating values.
Vortex Core includes a base set of encodings designed to mirror Arrow’s physical
types, including all three forms of string encoding: String, LargeString,
StringView
(German Strings).
Other useful encodings in Vortex include:
-
Constant - Stores a single scalar value and a length, trivially allowing us to optimize away a lot of compute.
-
Chunked - Native support for assembling arrays from multiple chunks of the same logical type, without constraints on the physical encoding.
-
Sparse - Stores a scalar fill value, as well as an array of indices and associated ‘patch’ values. This is very useful when data is almost uniform. For example, if 99% of values pack into 3-bits, but a handful are much larger.
-
ByteBool - unlike Arrow (and notably the C++
std::vector<bool>), most languages store arrays of booleans using one byte each. This encoding allows us to support zero-copy to byte-per-bool encodings, such as with Numpy arrays.
And of course, a full set of state-of-the-art compressed encodings:
-
FastLanes BitPacking - packs integers into the minimal number of bits required to represent the actual values. Read more in our post: Life in the FastLanes.
-
FSST - a technique for string compression that supports fast random access and compressed comparisons. Read more in our post: Compressing strings with FSST
-
... and many others, including FastLanes Delta, Frame of Reference, Dictionary, RunEnd, Adaptive Lossless Floating Point (ALP), and ZigZag.
All of these encodings can also be composed hierarchically, allowing writers to
tune for the best compression ratio, the highest performance, or anything
in-between.
Early Days
The Vortex project is still in its infancy, but we're excited to engage with the
open source community to improve it. If you have feedback or questions, please
feel free to
file GitHub issues!
P.S. If you stuck around this long, why
not send us your CV! We’re hiring for in-office
roles in NYC 🇺🇸 and London 🇬🇧.