Towards Vortex 1.0

Preparing to launch Vortex 1.0 and the 2025.05 Edition

Apr 24, 2025

by Nicholas Gates

Vortex is a new columnar file format designed to take advantage of state of the art research. An important aspect of this is the ability to continue to stay at the cutting edge as this research evolves.

Vortex achieves this by keeping the footer specification absolutely minimal, and publishing date-stamped “editions” describing a blessed set of encodings and layouts that a file may contain.

To understand what any of this means, let’s touch on a little background about Arrays and Layouts.

Vortex Arrays

The Vortex ecosystem covers both the file format as well as in-memory compressed arrays. These arrays can be read zero-copy from the file (enabling memory mapping) giving clients more control over where and how to run I/O vs CPU work.

Each in-memory array has a logical type, e.g. UTF-8, some number of aligned byte buffers, some number of child arrays, and a physical encoding, e.g. FSST. The logical type determines what data is represented by the array, and the physical encoding determines how that data is laid out as bytes.

As a brief digression, and to complete the picture, Vortex encodings can provide implementations of compute kernels in order to support push-down compute over compressed data. Using this technique, Vortex achieves random access performance from local disk up to 200x faster than Parquet.

Array encodings are entirely pluggable, meaning it’s possible to create custom encodings for your use-case and perform push-down compute, send them over-the-wire, or even embed them in a Vortex file.

You can sort of think of these arrays as compressed Arrow (although for a variety of reasons I wouldn’t recommend blindly swapping out Arrow for Vortex in your projects).

Vortex Layouts

If Vortex Arrays represent complete data in-memory, Vortex Layouts can be seen as the “lazy” equivalent. We can understand the shape of a layout, and plan how we might scan it, without fetching any of its data.

Similar to an array, a layout has a logical data type, some number of child layouts, and some number of lazy data buffers called segments.

As with arrays, layouts are completely extensible and can be written for your specific use-case. The current set of built-in layouts includes:

FlatLayout - storing a single serialized Vortex array in a single data segment
StructLayout - partitioning struct arrays into one child layout per field
ChunkedLayout - partitioning large arrays into one child layout per row partition
ZoneMapLayout - storing statistics for each logical zone of an array, independent of how the data is physically partitioned.
DictLayout - storing one layout for unique dictionary values, and one layout, typically chunked, containing indices into the dictionary.

Note that layouts don’t specify where their segments live, only that they exist and can be lazily fetched. This means we can do all sorts of interesting things, such as accelerating Vortex with segment caches, or even storing segments in Postgres block storage. For the purpose of this post however, we will discuss the typical case of storing segments in a Vortex file.

Structure of a Vortex File

The structure of a Vortex file is relatively simple, consisting of a version field and the u16 length of the Postscript FlatBuffer.

Markdown

0..4    Magic Bytes 'VTXF'
....    <segments>
....    Postscript FlatBuffer
-8..-6  Version: u16
-6..-4  Postscript Length: u16
-4..    Magic Bytes: 'VTXF'

The Postscript contains pointers to segments containing the schema, file-level statistics (e.g., min/max for each entire column), and the root layout. Each of these can be compressed or encrypted, and each contains its own FlatBuffer definition. You can find the full definitions in the documentation, but we will focus on the Footer.

table Postscript {
    /// Segment containing the root `DType` FlatBuffer (optional).
    dtype: PostscriptSegmentLocator;
    /// Segment containing the file-level `Statistics` FlatBuffer (optional).
    statistics: PostscriptSegmentLocator;
    /// Segment containing the `FileLayout` FlatBuffer (required).
    footer: PostscriptSegmentLocator;
}

table PostscriptSegmentLocator {
    offset: uint64;
    length: uint32;
    compression: CompressionSpec;
    encryption: EncryptionSpec;
}

table Footer {
    /// The root [`Layout`] of the file.
    layout: Layout;
    /// Dictionary-encoded segment locators, up to u32::MAX.
    segment_locators: [SegmentLocator];
    
    /// Dictionary-encoded array specs, up to u16::MAX.
    array_specs: [ArraySpec];
    /// Dictionary-encoded layout specs, up to u16::MAX.
    layout_specs: [LayoutSpec];
    /// Dictionary-encoded compress specs, up to u3::MAX (8).
    compression_specs: [CompressionSpec];
    /// Dictionary-encoded encryption specs, up to u16::MAX.
    encryption_specs: [EncryptionSpec];
}

The spec fields in the Footer essentially describe the set of features used in the Vortex file. We have already discussed how Arrays and Layouts are extensible, but we can also define segment-level black-box compression and even segment-level encryption.

By keeping the Postscript and Footer definitions small, we minimize the potential need for a major revision of the format, while still allowing the encodings and layout of the file to evolve.

Vortex Editions

It is of course possible for a Vortex writer to use wholly custom layouts and encodings, with a custom write strategy deciding how to layout and compress the data. It’s even possible to create a row-based file format if you decide columnar is not for you!

To tame this level of extensibility, we will shortly begin publishing “Editions” of the Vortex file format.

Each edition represents a set of features that a Vortex file may contain, including array encodings, layouts, compression schemes, and encryption algorithms. Readers will always be able to consume files written with an older edition of the writer, and writers can be configured to target older editions of a reader.

The name of each edition follows a YYYY.MM.DD format, allowing the writer to be configured on a moving scale. For example, I can configure a writer to support readers that are up to 3 months old by generating a maximum edition each time I write a file. This gives readers a three month window in which to upgrade in order to guarantee they will be able to read any new files.

Finally, the minimum reader edition depends on the actual features used, rather than the potential features available to the writer. For example, suppose we release a new edition of Vortex with a tensor encoding. If the file contains no tensors, then existing readers will be perfectly happy to read the file.

Write Strategies

Since Vortex is a largely self-describing format, we have a lot of scope to make improvements in the writer even within an edition.

The default Vortex write strategy is relatively complex and based loosely on ClickHouse’s files. But there is certainly room for improvement. Roughly speaking it follows this sequence of steps:

Struct arrays (or tables) are split into fields.
Arrays for each field are repartitioned into 8k row chunks, statistics are computed and stored in a zone map.
Arrays are repartitioned again, this time until they measure 1MB uncompressed.
Each 1MB chunk is passed into our BtrBlocks-inspired sampling compressor.
Compressed arrays are then buffered in memory until they measure 2MB compressed, after which they are flushed into the file. This creates some amount of locality within columns.
Finally, zone maps are flushed at the end of the file.

Like most things in Vortex, the WriteStrategy is entirely pluggable. Using a combination of built-in layouts—specifically Chunked, Struct, and Flat—it’s possible to model the layout behavior of most existing columnar file formats, including Parquet. So if you really really want row groups, you can have them!

Anatomy of a Vortex File

To make everything a little clearer, here’s a rough diagram detailing the above:

Anatomy of a Vortex File

Aside: Forwards Compatibility

Everything discussed thus far focusses on the idea of backwards compatibility. That is, a new reader can consume files written by an old writer. But in order to stay at the cutting edge, while still targeting the use-case of long-term storage, we believe it is important to bake the idea of forwards compatibility into Vortex.

Note that forwards compatibility isn’t yet implemented, and we do not intend to support it in the initial Vortex release.

In other words, can we upgrade the writer to use new encodings and layouts, without breaking old readers?

This seems counter-intuitive, how can an old reader know about compression research papers before they’ve been published?

The answer is perhaps pushing the bounds of cheekiness, but we can configure writers to embed a WebAssembly decompression kernel into the file itself (or hosted at some trusted location). While this won’t perform at native speeds, SIMD support in WASM provides surprisingly good performance, and it’s probably reasonable to assume that slow is better than broken.

While forwards compatibility in this way is plausible, we haven’t yet implemented it. While we intend to do so in a way that protects as strongly as possible against executing malicious code, it may not be appropriate for use in security-sensitive environments or when reading files from untrusted sources.

Release Cadence

The Rust, Python, Java, and C bindings, the Terminal UI, and the DataFusion, DuckDB, and Polars client packages will continue to be released at a relatively high frequency. We have no plans at this time to release semantic 1.0 versions of these libraries.

We hope to release V1 of the Vortex File Format along with the first 2025.05.XX edition sometime in the coming month, as well as announcing some other exciting changes to the Vortex project!

Towards Vortex 1.0

Vortex Arrays

Vortex Layouts

Structure of a Vortex File

Vortex Editions

Write Strategies

Anatomy of a Vortex File

Aside: Forwards Compatibility

Release Cadence

Recommended posts

Vortex on Ice

So you want to use Object Storage

ALP Rust is faster than C++