Latest posts
Announcing Spiral
Sep 11, 2025 · by Will Manning · 10 min read
I've been building data systems for long enough to be skeptical of “revolutionary” claims, and I’m uncomfortable with grandiose statements like “Built for the AI Era”. Nevertheless, AI workloads have tipped us into what I'll call the Third Age of data systems, and legacy platforms can't meet the moment.
Three eras of data systems
In the beginning, databases had human-scale inputs and human-scale outputs. Postgres—the king of databases, first released in 1989 [1] —is the archetypal application database. A trivial example of a core Postgres workflow is letting a user create a profile, view it, and then update the email address. Postgres needs to support many users doing so at the same time, but it was built for a world in which the rate of database writes was implicitly limited by humans taking discrete actions.
Then came the age of "Big Data", when we automated data collection at "web scale", with much more granular events. Early internet giants scraped every link on the entire internet and captured every click on their websites. For data systems, this was the dawn of machine-scale inputs. However, the only way for a human to engage with this machine-collected data was to distill it down—into a dashboard, a chart, or even a single number. The inputs to a data system might have been in petabytes, but the end products were still measurable in kilobytes.
This unprecedented scale of data collection also led to a technological schism: on one side, we saw the rise of data lakes, massive shared filesystems where we would dump files and run MapReduce jobs. On the other side were (cloud) data warehouses, which provided both scalability and ergonomics for simple data types like dates, numbers, and short text. This branching then eventually converged into "the Lakehouse", wherein the descendants of Hadoop discovered that tables were useful all along.[2]
We're now in the middle of another shift: the rise of the "machine consumer". In addition to machine-scale inputs, future data systems must be able to produce machine-scale outputs. Editing a few rows or aggregating a few simple columns is no longer enough. Machines don't want dashboards & summaries—they want everything.
What machines want
When I say machines want "everything," let me be specific. An NVIDIA H100 has enough memory bandwidth to consume 4 million 100KiB images per second. A Monte Carlo tree search might need to perform billions of random reads across your entire dataset. Machines want to perform fast scans, fast point lookups, and fast searches over petabyte–or exabyte!–scale data.
This is fundamentally different from the Second Age, when we optimized for human-friendly aggregations and reports. And here's where current infrastructure breaks down: between roughly 1KB and 25MB, Parquet files and object storage are both badly suited to the workload. Stored individually and assuming 50ms of S3 latency, reading 4 million individual 100KiB images—enough to saturate the H100 for one second—would accrue 55 hours of aggregate network overhead. Vector embeddings, small images, large documents—these are exactly what AI systems need, and exactly what current systems handle poorly.
Symptoms of the same disease
Second Age tools don't give teams the right abstractions for Third Age workloads. Forced to assemble systems from low-level primitives, teams pay in two predictable places:
First, price-performance. Your AI engineers are stuck in a Sisyphean loop: Read Parquet → Explode to Arrow (10x memory) → Convert to tensors → Cache intermediate results → (Finally) train -> Repeat. Five steps to do what should be simple: feed data to a GPU. Meanwhile, that H100 capable of consuming 4 million images per second sits idle ~70% of the time. Your even-more-expensive AI Engineer is manually shepherding each iteration (and possibly hoping for Zuck to show up with $1B).
Second, security. Simon Willison recently noted that Supabase's MCP connector can leak your entire database to anyone who can manipulate prompts.
Teams need to move fast. They need to experiment, iterate, and ship. But when their foundational needs aren’t met, they duct-tape solutions together. Database credentials get passed to AI agents. S3 bucket permissions get opened too wide. Audit logs are a fiction.
Security is a performance multiplier. Every hack you ship today is technical debt you'll pay 10x to fix later. Every permissions model you bypass is a multitenant feature you can't build. The same missing primitives that force performance workarounds make security nearly impossible to bolt on later.
Both stem from the same shortfall: missing abstractions. The tragedy is that today's infrastructure forces teams to choose between speed and safety at all.
We’re not the first to notice
Of course, we're not the first to recognize these problems. Smart people have been trying to bridge this gap.
The “Lakehouse” is the right high-level idea—object storage native is indeed the future.[3] But it's still duct-taping together a data lake and a data warehouse, duck-typing files as tables without fundamentally solving the unified storage problem. You end up managing multiple components with different permission models, different APIs, and different performance characteristics. As Ali Ghodsi likes to call it, you're managing a "data estate"—and like many estates, it's expensive, messy, and full of relics & the occasional skeleton.
WebDataset solved an immediate need for AI teams, but it's essentially equivalent to CSV files for the deep learning era: convenient for simple cases but lacking the performance, expressivity, & governance features that production systems demand.
These are all good Second Age solutions trying to solve Third Age problems. But Third Age problems need Third Age infrastructure—built from the ground up with machines as the primary consumer.
When OpenAI processes billions of images or Anthropic trains on massive document collections, they're not using traditional data warehouses. They've built custom infrastructure because they had to.
Working around these problems wasn't going to be enough. They required rethinking the architecture.
Data systems must evolve.
Building for what comes next
We started Spiral to take that next evolutionary step. Adapting legacy systems wasn't going to cut it; we needed to design for machine consumption from day one.
First, we created Vortex—a state-of-the-art columnar file format—and donated it to the Linux Foundation. Microsoft, Snowflake, and Palantir are backing it. TUM's fabled database group just released their latest paper "Anyblox", independently calling Vortex the "cutting edge" in file formats.
Vortex achieves Parquet's compression ratios with 10-20x faster scans, 5-10x faster writes, and 100-200x faster random access reads (1.5 milliseconds vs Parquet's 200 milliseconds). Depending on the query and the engine, it is no slower, and often dramatically faster. The longer-term breakthrough is more architectural: Vortex is designed to decode data directly from S3 to GPU, skipping the CPU bottleneck.
Spiral is our database built on Vortex: object store native from day one; unified governance across all data types; machine-scale throughput that actually saturates your GPUs; and one API for embeddings, images, and video. All with what we call "fearless permissioning" — move as fast as you want without compromising security, because the right primitives are built in.[4]
Remember that uncanny valley between 1KB and 25MB? The sizes aren't the problem. Second Age systems force you to choose between two bad options: inline the data (killing performance) or store pointers (breaking governance). Spiral eliminates this false choice. We store 10KB embeddings directly in Vortex for microsecond access, intelligently batch 10MB blocks of images for optimal S3 throughput, and externalize 4GB videos without copying a single byte.
When you stop pretending machines are just very fast humans, the entire architecture inverts. Throughput becomes the critical constraint. Object storage is the foundation. Security is unified rather than bolted on.
With $22 million in Seed & Series A funding from Amplify Partners & General Catalyst, we're building the infrastructure the Third Age of Data needs.
What Spiral delivers
In practice:
-
That H100 capable of consuming 4 million images per second? With Spiral, it actually can.
-
Sharing data without the security nightmare? Solved with time-bounded, audited, granular permissions.
-
The five-step data loading dance? A single query.
-
Your AI engineers? Actually working on AI.
The future is machine scale
We're tackling how to work with complex data at machine scale. A modern GPU can consume terabits per second, and in ways that existing systems aren't built for. Whether you're loading data from object storage into a GPU for pre-training, doing millions of concurrent point reads for Monte Carlo tree search, or trying to wrangle data that other systems can't, we're building for you.
We're working with design partners across computer vision, robotics, and multimodal AI. If you're spending more than 10% of your time on data infrastructure instead of model development, we should talk.
I started by saying I'm skeptical of revolutionary claims. But when the revolution is already here—when your GPUs are starving and you’re drowning in data—skepticism becomes denial. The question is whether you'll lead that evolution or be left behind.
The future doesn't care if you're ready. But we do.
Join us or reach out (hello at spiraldb dot com).
—
P.S. If you're still managing data in spreadsheets, this post isn't for you. Yet.
-
Taylor Swift, Postgres, and I are all roughly the same age. Coincidentally, I don't believe in dates before ~1989.
-
Tables are apparently like low rise jeans: cool in 2005 and back with a vengeance.
-
See e.g., Warpstream, Turbopuffer, and SlateDB for great examples of how powerful object storage native architectures can be.
-
I tried not to mention “written in Rust”, but yes, this term is inspired by fearless concurrency.