Latest posts
Vortex: a Linux Foundation Project
Aug 6, 2025 · by Nicholas Gates and Will Manning · 16 min read
When we first started building Spiral, our object-store native data system, we
were faced with an interesting fundamental question: where does storage end and
compute begin?
Today, we're excited to announce that our answer to this question—Vortex—has
become a Linux Foundation Incubation stage project, backed by
Microsoft, Palantir,
Snowflake, NVIDIA,
Influx Data,
ParadeDB,
Polar Signals,
Wes McKinney,
CMU’s Database Group, and
other visionary data builders. You can read more about it in the
press release.
In the modern AI landscape, compute is far more heterogeneous than pure big data
analytics. In addition to regular SQL analytics workloads, there is demand for
high-throughput access to training data, and low-latency access for search &
retrieval. All of which has to support many modalities of data: from basic
strings and numbers, to images, videos, vectors, and even entirely unstructured
data like PDFs.
To tackle this, Spiral was designed to be object-store native. By this I mean
data primarily living in object storage, a low cost system of record, with means
to accelerate access when required.
Spiral encourages users to bring-your-own-compute. Whether this is for
traditional SQL analytics with DataFusion, DuckDB, Trino; tensor-based compute
with PyTorch, Tensorflow, Jax; or even general-purpose distributed frameworks
like Spark, Ray, and Dask.
With compute in Spiral being so democratized, we can see how the boundary
between storage and compute becomes quite blurred. Can every system perform an
effective push-down? How do we leverage ephemeral caches to accelerate
workloads? How can we maximize throughput from object storage and saturate GPU
bandwidth?
The answer to all of these, as you may have guessed by now, is Vortex.
Besides being a highly performant columnar file format, Vortex also provides
composable tools for persisting, transferring, and querying compressed columnar
data. The technical validation has been overwhelmingly positive: Microsoft,
Snowflake, Palantir, and NVIDIA are backing it. TUM's fabled database group just
released their latest
paper "Anyblox",
independently calling Vortex the "cutting edge" in file formats.
Vortex achieves Parquet's compression ratios with at 10-20x faster scans, up to
5x faster writes, and 100x faster random access reads. Depending on the query
and the engine, it is no slower, and often dramatically faster. But the real,
long-term breakthrough? Vortex is designed to support decoding data directly
from S3 to GPU, skipping the CPU bottleneck entirely.
The rest of this post goes into detail about what Vortex is, some of the awesome
things it enables, and how we see the project evolving, having recently donated
it to the Linux Foundation.
Vortex Architecture
It’s time for a whistle stop tour of Vortex.
DTypes
As
I've spoken about before,
one of the first things we developed for Vortex was the concept of logical data
types, or "DTypes". In contrast
to Apache Arrow, we
needed a way to separate the memory layout of an array from the logical data
type it represents.
There is nothing particularly exciting about this type system (and that’s a good
thing!). We are still missing a few core types such as fixed-length lists,
fixed-length binary data, as well as
a variant type
to represent arbitrary nested JSON-like data.
One thing to note is that the Vortex StructDType supports lazy deserialization
from FlatBuffer format.
This allows Vortex to support ultra-wide schemas (>100k columns) with overhead
proportional to how many columns are actually used.
Arrays
A Vortex array is an in-memory object implementing the array trait. This
provides full scope to hold data in whatever format is desirable. This could be
flat Arrow arrays, compressed FSST string buffers, or even heap-allocated
Roaring Bitmaps.
To tame the chaos, Vortex defines a canonical array type for each logical dtype.
Almost all dtypes have a canonical form compatible with Apache Arrow. For
example, UTF8 data has a canonical form equivalent to
Arrow's Variable-size Binary View Layout.
This allows for easy interoperability with the Arrow ecosystem, and all arrays
must support conversion into this canonical form.
Compute Functions
The core physical operator you need to implement when plugging a file format
into a query engine is the Columnar Scan. You provide the engine some columns to
read, along with a filter predicate, and you receive back a stream of records.
The only way to make a scan faster is to push-down more, so that you can read
less. Formats like Parquet contain statistics and bloom filters to achieve this,
reducing the number of row groups read. The row groups that couldn't be pruned
would then be expanded into some memory representation, often Apache Arrow
record batches.
Vortex is no different, providing fine-grained statistics to reduce IOps for
filtered scans. But we took it a step further: filter-based pruning is great,
but what if we could execute arbitrary expressions directly on compressed
data? The network IO and memory bandwidth efficiencies would be significant.
Implementing it requires the kind of synergy between the in-memory and on-disk
format that only exists in Vortex.
Vortex holds a registry of compute functions that operate over some number of
array inputs and return an array output. There are plenty of built-in scalar
functions, such as boolean logic, numeric comparisons, even string pattern
matching (the LIKE operator in SQL). There are also several aggregation
functions such as min, max, sum etc.
All compute functions define an implementation over canonical arrays, as well as
dispatching their internal logic to a registry of kernels. These kernels allow
each array to override the implementation of a compute function in case they are
able to perform it more efficiently than first converting into canonical form.
This push-down compute allows Vortex to defer decompression in a surprisingly
large number of cases. For example, all scalar functions operating over
dictionary arrays are forwarded to the unique dictionary values, and
reconstructed as a dictionary array using the same untouched (and possibly
compressed) codes.
Compute push-down gets far more interesting in cases where we can push compute
into the compressed domain. For example,
I wrote about how we can see an 80% speed up
when performing comparisons against floating point data by pushing the operation
through ALP encoding
into the integer domain.
Layouts
Where arrays represent in-memory data in Vortex, layouts represent out-of-memory
data. The name comes from how layouts determine the partitioning of large arrays
into smaller chunks before they are laid out in a file. But the concept
generalizes regardless of where the underlying byte buffers (called segments)
are stored.
Vortex includes three basic layouts: struct, chunked, and flat. These represent
columnar partitioning, row-wise partitioning, and array-based leaf nodes
respectively. By combining these into a layout tree, Vortex is able to model
many different partitioning strategies of columnar data.
By abstracting layouts from the Vortex file format itself, it's actually
possible to leverage dtypes, compute push-down, late materialization, and many
other Vortex features without persisting data to a file at all. For example, one
of our early adopters ParadeDB is investigating storing Vortex layouts in
Postgres block storage. Similarly, this allows us to easily plug different
strategies for segment caching, including in-memory, on-disk, and even remote
caches.
Scanning
The most common operation we perform in Vortex is scanning. This involves
reading a layout tree into a stream of array chunks using a filter expression to
select rows.
Almost all implementations of scanning that you find in the wild will accept
some form of filter expression. This is known as predicate push-down and often
enables you to prune (avoid reading) large portions of data based on statistics
alone.
Python
import pyarrow.parquet as pq
pq.read_table("test.parquet", filters=[("a", "<", 10)])
Most implementations of scanning allow you to select which columns to return
in the result stream. This is known as projection push-down and allows file
formats to prune yet more data, fetching only the requested columns.
Python
pq.read_table("test.pq", columns=["a", "b"], filters=[("a", "<", 10)])
Vortex takes this further and supports what we are calling expression
push-down. Instead of a projection mask, the Vortex scan takes an arbitrary
projection expression.
This allows Vortex to leverage push-down compute kernels to perform as much
scalar compute as possible in compressed space, before returning data to the SQL
engine in its native form.
Evaluation Nodes
Vortex is able to achieve very fast scan performance in part because it models a
scan as a tree of evaluation nodes. This is conceptually similar to a physical
plan in a regular SQL engine, except it is designed to better support
late-materialization.
At the beginning of a scan, the layout tree is mapped into an equivalent tree of
evaluation nodes. Each node is given the filter or projection expression as well
as a row range and is able to transform the expression before passing it to its
children.
This concept allows us to implement lots of interesting things!
The first of these is our "filter evaluation" node. This node splits the
expression into conjunctive normal form and evaluates each conjunct one by one.
After each evaluation, the selectivity of the conjunct is measured and the order
of evaluation is updated. This allows Vortex to dynamically optimize the
evaluation order of predicates, often allowing us to entirely short-circuit some
evaluations (and therefore some data fetching).
The second example is our zone map evaluation node. This node prunes large
portions of the row mask by leveraging zone-based statistics. Unlike Parquet and
other columnar formats, the Vortex zone map is stored based on logical 8k
chunks, rather than being aligned to the physical page size. This allows for
efficient intra-chunk pruning even in the absence of a row-group based layout
strategy.
Vortex Files
Finally, we reach the bit where we actually write some bytes into a file. As we
have now seen, much of Vortex's functionality is implemented over abstract
concepts. This means the Vortex file format can focus on the few things that are
specific to files:
-
Reduce round-trips to object storage: most Parquet readers perform 3 round trips to object storage before they can begin reading data. Vortex is designed to almost always perform 1 (and worst-case 2) round trips.
-
Efficient reads from object storage: by performing smart pre-fetching of segments that we believe might be needed during the scan, along with coalescing segment reads to avoid small range reads from high latency storage.
-
Efficient reads from SSD: by ensuring all segments are correctly aligned within the file, Vortex can make use of memory-mapping, direct I/O, and late materialization to minimize reads from disk. This is made possible by using modern light-weight compression instead of general-purpose block compressors like Snappy or zstd.
Like most of Vortex, writing files is also extremely configurable by swapping
out the write strategy. In fact, we recently
accepted a contribution from Martin Loncaric
(author of PCodec) to add a "compact"
strategy that optimizes for compression ratio over read performance. In a few
lines of code, Vortex produced files 32% smaller than Parquet for NY Taxi Data
(although the regular BtrBlocks strategy was already 8% smaller!)
469M fhvhv_tripdata_2023-04.parquet (zstd compressed)
433M fhvhv_tripdata_2023-04_btrblocks.vortex
321M fhvhv_tripdata_2023-04_compact_inf.vortex
Plugins
All of the current compile-time extensible parts of Vortex (dtypes, arrays,
layouts, compute functions, and expressions) will eventually be able to be
registered at runtime from dynamically loaded libraries.
This will allow us to ship first-party extensions such as GeoVortex to provide
specific encodings for geospatial data, along with the related push-down compute
functions.
It also allows us to load plugins from other sources… such as WebAssembly! These
plugins can even be embedded within Vortex files themselves. This could be used
to provide forward compatibility with old readers, or to embed specialized
compute kernels optimized up-front for known read-time access patterns.
Integrations
Vortex has thus far been integrated with
the Apache Arrow, Apache DataFusion,
Apache Spark,
and DuckDB compute engines, with
future work lined up to support cuDF and Polars.
We also have ongoing work alongside
the Microsoft Gray Systems Labs
and other members of the community to integrate Vortex
into Apache Iceberg.
I don't want to get too bogged down by benchmarks, given how hard it is to draw
accurate comparisons between such different systems. But to give a rough idea of
performance, Microsoft was able
to drop-in-replace Vortex for Parquet in Iceberg and run TPC-DS SF=1000 on Spark, with a 30% reduction in runtime and a 20% reduction in storage.
What This All Enables
We are heavily invested in using Vortex within Spiral to build out our
multi-modal data platform. But there are plenty of neat ideas from both
ourselves and others for how Vortex can be used in the wild!
-
Columnar scans over abstract storage: ParadeDB have been looking into building a new analytics extension for Postgres that persists Vortex layouts into Postgres block storage.
-
Zero-copy LSM compaction: Microsoft have been exploring how Vortex performs against their LSM tree benchmark. With the ability to defer decompression, Vortex can sometimes optimize LSM compaction by simply moving compressed bytes untouched into the compacted file.
-
Read-optimized writer strategies: We saw how the compact strategy optimizes Vortex for compression ratio. But serious gains can be realized by optimizing at write-time for other known access patterns.
-
Late, late, materialization: Custom arrays could be created to allow for logical deferral of filter operations even across network boundaries. In the extreme, decompression can be deferred from S3, through the query engine, and all the way to the GPU.
-
Meta file format: Meta, Google, Snowflake, Firebolt, and likely many others are all known to have created custom columnar file formats. Many of these could have been modelled with Vortex abstractions, while leveraging shared improvements to I/O or other parts of the storage stack.
Why the Linux Foundation?
The composability of Vortex provides an excellent platform for file format
research. But the only way I know to give a project of this scale a chance of
succeeding is to build out a strong open-source community.
If we can solve the file format stagnation problem, we can consolidate work on
future file formats, share advances in research, and avoid the duplicative work
of creating file formats each time from the ground up.
We thus chose to donate Vortex to the Linux Foundation to continue its
development under an open and collaborative governance model.
The Linux Foundation provides:
-
Neutral governance that ensures no single vendor controls the format
-
Long-term stability for enterprises building critical infrastructure
-
Clear IP frameworks that enable confident contributions
-
Ecosystem alignment with other critical data infrastructure projects
Getting Started & Contributing
Vortex is available now:
# Rust
cargo add vortex
# Python
pip install vortex-data
# CLI tool for exploring Vortex files
cargo install vortex-tui --locked
vx convert <file.parquet>
vx browse <file.vortex>
-
Current Status: Early but functional implementation
-
Contributing: We welcome contributions of code, research, and documentation. See CONTRIBUTING.md
-
Community: Join the conversation on GitHub, or reach out to hello@vortex.dev if you have an advanced use-case you would like assistance with.
The Road Ahead
As Vortex joins the Linux Foundation, we're excited about what comes next:
-
Supporting more languages, compute engines, and frameworks
-
Implementing GPU-direct decompression paths
-
Support for geospatial data with a GeoVortex plugin
-
Building domain-specific encodings through the plugin system
-
Exploring novel compression research through our extensible platform
But more than any specific feature, we're excited to see what the community
builds. The extensible architecture means the innovations we haven't imagined
yet are not just possible—they're inevitable.
Vortex is now a Linux Foundation project. For more information about project
governance, see
our** **Technical Charter**.