Vortex: a Linux Foundation Project

Building the Future of Open Source, Columnar Storage

Aug 6, 2025
by Nicholas Gates
Cover image for Vortex: a Linux Foundation Project
Arrow IconGo back

When we first started building Spiral, our object-store native data system, we were faced with an interesting fundamental question: where does storage end and compute begin?

Today, we're excited to announce that our answer to this question—Vortex—has become a Linux Foundation Incubation stage project, backed by Microsoft, Palantir, Snowflake, NVIDIA, Influx Data, ParadeDB, Polar Signals, Wes McKinney, CMU’s Database Group, and other visionary data builders. You can read more about it in the press release.

In the modern AI landscape, compute is far more heterogeneous than pure big data analytics. In addition to regular SQL analytics workloads, there is demand for high-throughput access to training data, and low-latency access for search & retrieval. All of which has to support many modalities of data: from basic strings and numbers, to images, videos, vectors, and even entirely unstructured data like PDFs.

To tackle this, Spiral was designed to be object-store native. By this I mean data primarily living in object storage, a low cost system of record, with means to accelerate access when required.

Spiral encourages users to bring-your-own-compute. Whether this is for traditional SQL analytics with DataFusion, DuckDB, Trino; tensor-based compute with PyTorch, Tensorflow, Jax; or even general-purpose distributed frameworks like Spark, Ray, and Dask.

With compute in Spiral being so democratized, we can see how the boundary between storage and compute becomes quite blurred. Can every system perform an effective push-down? How do we leverage ephemeral caches to accelerate workloads? How can we maximize throughput from object storage and saturate GPU bandwidth?

The answer to all of these, as you may have guessed by now, is Vortex.

Besides being a highly performant columnar file format, Vortex also provides composable tools for persisting, transferring, and querying compressed columnar data. The technical validation has been overwhelmingly positive: Microsoft, Snowflake, Palantir, and NVIDIA are backing it. TUM's fabled database group just released their latest paper "Anyblox", independently calling Vortex the "cutting edge" in file formats.

Vortex achieves Parquet's compression ratios with at 10-20x faster scans, up to 5x faster writes, and 100x faster random access reads. Depending on the query and the engine, it is no slower, and often dramatically faster. But the real, long-term breakthrough? Vortex is designed to support decoding data directly from S3 to GPU, skipping the CPU bottleneck entirely.

The rest of this post goes into detail about what Vortex is, some of the awesome things it enables, and how we see the project evolving, having recently donated it to the Linux Foundation.

Vortex Architecture

It’s time for a whistle stop tour of Vortex.

DTypes

As I've spoken about before, one of the first things we developed for Vortex was the concept of logical data types, or "DTypes". In contrast to Apache Arrow, we needed a way to separate the memory layout of an array from the logical data type it represents.

There is nothing particularly exciting about this type system (and that’s a good thing!). We are still missing a few core types such as fixed-length lists, fixed-length binary data, as well as a variant type to represent arbitrary nested JSON-like data.

One thing to note is that the Vortex StructDType supports lazy deserialization from FlatBuffer format. This allows Vortex to support ultra-wide schemas (>100k columns) with overhead proportional to how many columns are actually used.

Arrays

A Vortex array is an in-memory object implementing the array trait. This provides full scope to hold data in whatever format is desirable. This could be flat Arrow arrays, compressed FSST string buffers, or even heap-allocated Roaring Bitmaps.

To tame the chaos, Vortex defines a canonical array type for each logical dtype. Almost all dtypes have a canonical form compatible with Apache Arrow. For example, UTF8 data has a canonical form equivalent to Arrow's Variable-size Binary View Layout. This allows for easy interoperability with the Arrow ecosystem, and all arrays must support conversion into this canonical form.

Compute Functions

The core physical operator you need to implement when plugging a file format into a query engine is the Columnar Scan. You provide the engine some columns to read, along with a filter predicate, and you receive back a stream of records.

The only way to make a scan faster is to push-down more, so that you can read less. Formats like Parquet contain statistics and bloom filters to achieve this, reducing the number of row groups read. The row groups that couldn't be pruned would then be expanded into some memory representation, often Apache Arrow record batches.

Vortex is no different, providing fine-grained statistics to reduce IOps for filtered scans. But we took it a step further: filter-based pruning is great, but what if we could execute arbitrary expressions directly on compressed data? The network IO and memory bandwidth efficiencies would be significant.

Implementing it requires the kind of synergy between the in-memory and on-disk format that only exists in Vortex.

Vortex holds a registry of compute functions that operate over some number of array inputs and return an array output. There are plenty of built-in scalar functions, such as boolean logic, numeric comparisons, even string pattern matching (the LIKE operator in SQL). There are also several aggregation functions such as min, max, sum etc.

All compute functions define an implementation over canonical arrays, as well as dispatching their internal logic to a registry of kernels. These kernels allow each array to override the implementation of a compute function in case they are able to perform it more efficiently than first converting into canonical form.

This push-down compute allows Vortex to defer decompression in a surprisingly large number of cases. For example, all scalar functions operating over dictionary arrays are forwarded to the unique dictionary values, and reconstructed as a dictionary array using the same untouched (and possibly compressed) codes.

Compute push-down gets far more interesting in cases where we can push compute into the compressed domain. For example, I wrote about how we can see an 80% speed up when performing comparisons against floating point data by pushing the operation through ALP encoding into the integer domain.

Layouts

Where arrays represent in-memory data in Vortex, layouts represent out-of-memory data. The name comes from how layouts determine the partitioning of large arrays into smaller chunks before they are laid out in a file. But the concept generalizes regardless of where the underlying byte buffers (called segments) are stored.

Vortex includes three basic layouts: struct, chunked, and flat. These represent columnar partitioning, row-wise partitioning, and array-based leaf nodes respectively. By combining these into a layout tree, Vortex is able to model many different partitioning strategies of columnar data.

By abstracting layouts from the Vortex file format itself, it's actually possible to leverage dtypes, compute push-down, late materialization, and many other Vortex features without persisting data to a file at all. For example, one of our early adopters ParadeDB is investigating storing Vortex layouts in Postgres block storage. Similarly, this allows us to easily plug different strategies for segment caching, including in-memory, on-disk, and even remote caches.

Scanning

The most common operation we perform in Vortex is scanning. This involves reading a layout tree into a stream of array chunks using a filter expression to select rows.

Almost all implementations of scanning that you find in the wild will accept some form of filter expression. This is known as predicate push-down and often enables you to prune (avoid reading) large portions of data based on statistics alone.

Code Icon
Python
import pyarrow.parquet as pq
pq.read_table("test.parquet", filters=[("a", "<", 10)])

Most implementations of scanning allow you to select which columns to return in the result stream. This is known as projection push-down and allows file formats to prune yet more data, fetching only the requested columns.

Code Icon
Python
pq.read_table("test.pq", columns=["a", "b"], filters=[("a", "<", 10)])

Vortex takes this further and supports what we are calling expression push-down. Instead of a projection mask, the Vortex scan takes an arbitrary projection expression.

This allows Vortex to leverage push-down compute kernels to perform as much scalar compute as possible in compressed space, before returning data to the SQL engine in its native form.

Evaluation Nodes

Vortex is able to achieve very fast scan performance in part because it models a scan as a tree of evaluation nodes. This is conceptually similar to a physical plan in a regular SQL engine, except it is designed to better support late-materialization.

At the beginning of a scan, the layout tree is mapped into an equivalent tree of evaluation nodes. Each node is given the filter or projection expression as well as a row range and is able to transform the expression before passing it to its children.

This concept allows us to implement lots of interesting things!

The first of these is our "filter evaluation" node. This node splits the expression into conjunctive normal form and evaluates each conjunct one by one. After each evaluation, the selectivity of the conjunct is measured and the order of evaluation is updated. This allows Vortex to dynamically optimize the evaluation order of predicates, often allowing us to entirely short-circuit some evaluations (and therefore some data fetching).

The second example is our zone map evaluation node. This node prunes large portions of the row mask by leveraging zone-based statistics. Unlike Parquet and other columnar formats, the Vortex zone map is stored based on logical 8k chunks, rather than being aligned to the physical page size. This allows for efficient intra-chunk pruning even in the absence of a row-group based layout strategy.

Vortex Files

Finally, we reach the bit where we actually write some bytes into a file. As we have now seen, much of Vortex's functionality is implemented over abstract concepts. This means the Vortex file format can focus on the few things that are specific to files:

  • Reduce round-trips to object storage: most Parquet readers perform 3 round trips to object storage before they can begin reading data. Vortex is designed to almost always perform 1 (and worst-case 2) round trips.

  • Efficient reads from object storage: by performing smart pre-fetching of segments that we believe might be needed during the scan, along with coalescing segment reads to avoid small range reads from high latency storage.

  • Efficient reads from SSD: by ensuring all segments are correctly aligned within the file, Vortex can make use of memory-mapping, direct I/O, and late materialization to minimize reads from disk. This is made possible by using modern light-weight compression instead of general-purpose block compressors like Snappy or zstd.

Like most of Vortex, writing files is also extremely configurable by swapping out the write strategy. In fact, we recently accepted a contribution from Martin Loncaric (author of PCodec) to add a "compact" strategy that optimizes for compression ratio over read performance. In a few lines of code, Vortex produced files 32% smaller than Parquet for NY Taxi Data (although the regular BtrBlocks strategy was already 8% smaller!)

Code Icon
469M fhvhv_tripdata_2023-04.parquet (zstd compressed)
433M fhvhv_tripdata_2023-04_btrblocks.vortex
321M fhvhv_tripdata_2023-04_compact_inf.vortex

Plugins

All of the current compile-time extensible parts of Vortex (dtypes, arrays, layouts, compute functions, and expressions) will eventually be able to be registered at runtime from dynamically loaded libraries.

This will allow us to ship first-party extensions such as GeoVortex to provide specific encodings for geospatial data, along with the related push-down compute functions.

It also allows us to load plugins from other sources… such as WebAssembly! These plugins can even be embedded within Vortex files themselves. This could be used to provide forward compatibility with old readers, or to embed specialized compute kernels optimized up-front for known read-time access patterns.

Integrations

Vortex has thus far been integrated with the Apache Arrow, Apache DataFusion, Apache Spark, and DuckDB compute engines, with future work lined up to support cuDF and Polars.

We also have ongoing work alongside the Microsoft Gray Systems Labs and other members of the community to integrate Vortex into Apache Iceberg.

I don't want to get too bogged down by benchmarks, given how hard it is to draw accurate comparisons between such different systems. But to give a rough idea of performance, Microsoft was able to drop-in-replace Vortex for Parquet in Iceberg and run TPC-DS SF=1000 on Spark, with a 30% reduction in runtime and a 20% reduction in storage.

What This All Enables

We are heavily invested in using Vortex within Spiral to build out our multi-modal data platform. But there are plenty of neat ideas from both ourselves and others for how Vortex can be used in the wild!

  • Columnar scans over abstract storage: ParadeDB have been looking into building a new analytics extension for Postgres that persists Vortex layouts into Postgres block storage.

  • Zero-copy LSM compaction: Microsoft have been exploring how Vortex performs against their LSM tree benchmark. With the ability to defer decompression, Vortex can sometimes optimize LSM compaction by simply moving compressed bytes untouched into the compacted file.

  • Read-optimized writer strategies: We saw how the compact strategy optimizes Vortex for compression ratio. But serious gains can be realized by optimizing at write-time for other known access patterns.

  • Late, late, materialization: Custom arrays could be created to allow for logical deferral of filter operations even across network boundaries. In the extreme, decompression can be deferred from S3, through the query engine, and all the way to the GPU.

  • Meta file format: Meta, Google, Snowflake, Firebolt, and likely many others are all known to have created custom columnar file formats. Many of these could have been modelled with Vortex abstractions, while leveraging shared improvements to I/O or other parts of the storage stack.

Why the Linux Foundation?

The composability of Vortex provides an excellent platform for file format research. But the only way I know to give a project of this scale a chance of succeeding is to build out a strong open-source community.

If we can solve the file format stagnation problem, we can consolidate work on future file formats, share advances in research, and avoid the duplicative work of creating file formats each time from the ground up.

We thus chose to donate Vortex to the Linux Foundation to continue its development under an open and collaborative governance model.

The Linux Foundation provides:

  • Neutral governance that ensures no single vendor controls the format

  • Long-term stability for enterprises building critical infrastructure

  • Clear IP frameworks that enable confident contributions

  • Ecosystem alignment with other critical data infrastructure projects

Getting Started & Contributing

Vortex is available now:

Code Icon
# Rust
cargo add vortex

# Python
pip install vortex-data

# CLI tool for exploring Vortex files
cargo install vortex-tui --locked
vx convert <file.parquet>
vx browse <file.vortex>
  • Current Status: Early but functional implementation

  • Contributing: We welcome contributions of code, research, and documentation. See CONTRIBUTING.md

  • Community: Join the conversation on GitHub, or reach out to hello@vortex.dev if you have an advanced use-case you would like assistance with.

The Road Ahead

As Vortex joins the Linux Foundation, we're excited about what comes next:

  • Supporting more languages, compute engines, and frameworks

  • Implementing GPU-direct decompression paths

  • Support for geospatial data with a GeoVortex plugin

  • Building domain-specific encodings through the plugin system

  • Exploring novel compression research through our extensible platform

But more than any specific feature, we're excited to see what the community builds. The extensible architecture means the innovations we haven't imagined yet are not just possible—they're inevitable.

Vortex is now a Linux Foundation project. For more information about project governance, see our Technical Charter.