Latest posts
Announcing Spiral
Sep 11, 2025 · by Will Manning · 10 min read
I've been building data systems for long enough to be skeptical of
"revolutionary" claims, and I'm uncomfortable with grandiose statements like
"Built for the AI Era". Nevertheless, AI workloads have tipped us into what I'll
call the Third Age of data systems, and legacy platforms can't meet the moment.
Three eras of data systems
In the beginning, databases had human-scale inputs and human-scale outputs.
Postgres—the king of databases, first released in
1989 [1] —is the archetypal
application database. A trivial example of a core Postgres workflow is letting a
user create a profile, view it, and then update the email address. Postgres
needs to support many users doing so at the same time, but it was built for a
world in which the rate of database writes was implicitly limited by humans
taking discrete actions.
Then came the age of "Big Data", when we automated data collection at "web
scale", with much more granular events. Early internet giants scraped every link
on the entire internet and captured every click on their websites. For data
systems, this was the dawn of machine-scale inputs. However, the only way for
a human to engage with this machine-collected data was to distill it down—into a
dashboard, a chart, or even a single number. The inputs to a data system might
have been in petabytes, but the end products were still measurable in kilobytes.
This unprecedented scale of data collection also led to a technological schism:
on one side, we saw the rise of data lakes, massive shared filesystems where we
would dump files and run MapReduce jobs. On the other side were (cloud) data
warehouses, which provided both scalability and ergonomics for simple data types
like dates, numbers, and short text. This branching then eventually converged
into "the Lakehouse", wherein the descendants of Hadoop discovered that
tables were useful all along.[2]
We're now in the middle of another shift: the rise of the "machine consumer". In
addition to machine-scale inputs, future data systems must be able to produce
machine-scale outputs. Editing a few rows or aggregating a few simple columns
is no longer enough. Machines don't want dashboards & summaries—they want
everything.
What machines want
When I say machines want "everything," let me be specific. An NVIDIA H100 has
enough memory bandwidth to consume 4 million 100KiB images per second. A Monte
Carlo tree search might need to perform billions of random reads across your
entire dataset. Machines want to perform fast scans, fast point lookups, and
fast searches over petabyte—or exabyte!—scale data.
This is fundamentally different from the Second Age, when we optimized for
human-friendly aggregations and reports. And here's where current infrastructure
breaks down: between roughly 1KB and 25MB, Parquet files and object storage are
both badly suited to the workload. Stored individually and assuming 50ms of S3
latency, reading 4 million individual 100KiB images—enough to saturate the H100
for one second—would accrue 55 hours of aggregate network overhead. Vector
embeddings, small images, large documents—these are exactly what AI systems
need, and exactly what current systems handle poorly.
Symptoms of the same disease
Second Age tools don't give teams the right abstractions for Third Age
workloads. Forced to assemble systems from low-level primitives, teams pay in
two predictable places:
First, price-performance. Your AI engineers are stuck in a Sisyphean loop:
Read Parquet → Explode to Arrow (10x memory) → Convert to tensors → Cache
intermediate results → (Finally) train → Repeat. Five steps to do what should be
simple: feed data to a GPU. Meanwhile, that H100 capable of consuming 4 million
images per second sits idle ~70% of the time. Your even-more-expensive AI
Engineer is manually shepherding each iteration (and possibly hoping for Zuck to
show up with $1B).
Second, security. Simon Willison recently noted that
Supabase's MCP connector can leak your entire database
to anyone who can manipulate prompts.
Teams need to move fast. They need to experiment, iterate, and ship. But when
their foundational needs aren't met, they duct-tape solutions together. Database
credentials get passed to AI agents. S3 bucket permissions get opened too wide.
Audit logs are a fiction.
Security is a performance multiplier. Every hack you ship today is technical
debt you'll pay 10x to fix later. Every permissions model you bypass is a
multitenant feature you can't build. The same missing primitives that force
performance workarounds make security nearly impossible to bolt on later.
Both stem from the same shortfall: missing abstractions. The tragedy is that
today's infrastructure forces teams to choose between speed and safety at all.
We're not the first to notice
Of course, we're not the first to recognize these problems. Smart people have
been trying to bridge this gap.
The "Lakehouse" is the right high-level idea—object storage native is indeed the
future.[3] But it's still duct-taping together a data lake and a data warehouse,
duck-typing files as tables without fundamentally solving the unified storage
problem. You end up managing multiple components with different permission
models, different APIs, and different performance characteristics. As Ali Ghodsi
likes to call it, you're managing a "data estate"—and like many estates, it's
expensive, messy, and full of relics & the occasional skeleton.
WebDataset solved an immediate need for AI teams, but it's essentially
equivalent to CSV files for the deep learning era: convenient for simple cases
but lacking the performance, expressivity, & governance features that production
systems demand.
These are all good Second Age solutions trying to solve Third Age problems. But
Third Age problems need Third Age infrastructure—built from the ground up with
machines as the primary consumer.
When OpenAI processes billions of images or Anthropic trains on massive document
collections, they're not using traditional data warehouses. They've built custom
infrastructure because they had to.
Working around these problems wasn't going to be enough. They required
rethinking the architecture.
Data systems must evolve.
Building for what comes next
We started Spiral to take that next evolutionary step. Adapting legacy systems
wasn't going to cut it; we needed to design for machine consumption from day
one.
First, we created Vortex—a state-of-the-art columnar file
format—and donated it to the Linux Foundation. Microsoft, Snowflake, and
Palantir are backing it. TUM's fabled database group just released their latest
paper "Anyblox", independently calling
Vortex the "cutting edge" in file formats.
Vortex achieves Parquet's compression ratios with 10-20x faster scans, 5-10x
faster writes, and
100-200x faster random access reads
(1.5 milliseconds vs Parquet's 200 milliseconds). Depending on the query and the
engine, it is no slower, and often dramatically faster. The longer-term
breakthrough is more architectural: Vortex is designed to decode data directly
from S3 to GPU, skipping the CPU bottleneck.
Spiral is our data platform built on Vortex: object store native from day one;
unified governance across all data types; machine-scale throughput that actually
saturates your GPUs; and one API for embeddings, images, and video. All with
what we call "fearless permissioning" — move as fast as you want without
compromising security, because the right primitives are built in.[4]
Remember that uncanny valley between 1KB and 25MB? The sizes aren't the problem.
Second Age systems force you to choose between two bad options: inline the data
(killing performance) or store pointers (breaking governance). Spiral eliminates
this false choice. We store 10KB embeddings directly in Vortex for microsecond
access, intelligently batch 10MB blocks of images for optimal S3 throughput, and
externalize 4GB videos without copying a single byte.
When you stop pretending machines are just very fast humans, the entire
architecture inverts. Throughput becomes the critical constraint. Object storage
is the foundation. Security is unified rather than bolted on.
With $22 million in Seed & Series A funding from Amplify Partners & General
Catalyst, we're building the infrastructure the Third Age of Data needs.
What Spiral delivers
In practice:
-
That H100 capable of consuming 4 million images per second? With Spiral, it actually can.
-
Sharing data without the security nightmare? Solved with time-bounded, audited, granular permissions.
-
The five-step data loading dance? A single query.
-
Your AI engineers? Actually working on AI.
The future is machine scale
We're tackling how to work with complex data at machine scale. A modern GPU
can consume terabits per second, and in ways that existing systems aren't
built for. Whether you're loading data from object storage into a GPU for
pre-training, doing millions of concurrent point reads for Monte Carlo tree
search, or trying to wrangle data that other systems can't, we're building for
you.
We're working with design partners across computer vision, robotics, and
multimodal AI. If you're spending more than 10% of your time on data
infrastructure instead of model development, we should talk.
I started by saying I'm skeptical of revolutionary claims. But when the
revolution is already here—when your GPUs are starving and you're drowning in
data—skepticism becomes denial.
The future doesn't care if you're ready. But we do.
Join us or reach out (hello at spiraldb dot
com).
—
P.S. If you're still managing data in spreadsheets, this post isn't for you.
Yet.
-
Taylor Swift, Postgres, and I are all roughly the same age. Coincidentally, I don't believe in dates before ~1989.
-
Tables are apparently like low rise jeans: cool in 2005 and back with a vengeance.
-
See e.g., Warpstream, Turbopuffer, and SlateDB for great examples of how powerful object storage native architectures can be.
-
I tried not to mention "written in Rust", but yes, this term is inspired by fearless concurrency.