Deep Dives / Video

Machines do not press play.

From temporal compression to model-ready tensors.

Video has spent decades being optimized for one consumer: a human pressing play. That consumer wants a smooth stream. Seek near a keyframe, decode forward, display frames at wall-clock speed. The only real job is to stay ahead of the viewer. Video playback is an unusually tightly scoped engineering target, and the codecs and containers hit it extremely well.

But machines do not press play. A multimodal warehouse — storage built to serve video alongside every other modality a model trains on — touches the same files for a workload that looks nothing like playback.

Inside any modern video file is a queryable structure, the rules for turning stored bytes back into pixels. This deep dive is about treating that structure as data. We take H.264 inside an MP4 container apart at every layer a warehouse has to reason about, follow a frame-level query of your choice from request to finished pixels, and show two ways the warehouse can do better: run a smarter query against the file as it stands, or reshape the file so the query is cheap by construction.

Pick a video to explore.

Pick a video

The deep dive follows whichever file you select.

Metadata loading…

What machines ask of video

Video systems serve a small set of recognizable workloads. Most production data pipelines around video are some combination of:

Despite their differences, all of these access patterns reduce a query over a set of (file, timestamp, output_shape) tuples, possibly aligned across modalities, possibly batched across many files. The job of a multi-modal warehouse is to efficiently serve the required access patterns from compressed bytes all the way to on-device RGB tensors, all while keeping within reasonable limits of storage, compute, and networking cost.

Playback isn't the antagonist here. It's the workload existing video formats and tools were built for, and the warehouse inherits all of that machinery. The interesting question is how to use the same machinery for a different shape of demand.

Why H.264 in MP4?

This deep dive uses H.264 inside MP4 because it is the mainstream baseline many datasets already contain: common in production video collections, broadly supported by tools, widely hardware accelerated through NVDEC, VideoToolbox, AMD VCN, and Intel Quick Sync hardware decode paths. It is also old enough that its container/codec split is well understood. The point is not that H.264 is the only interesting format. It is that the warehouse problem shows up in the ordinary format people already have.

H.264 also keeps the mechanics legible. The MP4 container exposes the timing and byte-address side; the H.264 bitstream exposes the prediction graph; and that split is close enough to H.265 and AV1 that the same planning vocabulary carries over. Newer codecs mostly make the constants harsher: they spend more decode complexity to save delivery bandwidth, which is the right tradeoff for playback and a dangerous default when a warehouse pays decode again for every sampled frame.

AV2makes the direction clear. VideoLAN's dav2d note describes AV2 decoding as “roughly five times” AV1 complexity and says software on today's machines will struggle to hit real time without architecture-specific optimization.

Tracing a query

We'll follow the access pattern selected above through the rest of the chapter. Right now that is event-aligned clip: 8 before, 8 after @ 10 fps · 16 frames over 1.60s · t = 0.00s.

That query is relatively simple, but there are many stages to go through before the model sees a usable tensor. The deep dive will walk through each of the following stages in enough detail for you to hopefully build an intuition for the costs involved.

timestamps
  -> presentation frames
  -> compressed samples (PTS -> DTS)
  -> decode closure
  -> byte ranges
  -> decoder bitstream
  -> decoded YUV surfaces
  -> RGB / crop / resize / normalize
  -> tensors

Timestamps are not frame numbers

The first edge in the plan is timestamps -> presentation frames. Sometimes that looks like arithmetic: if the clip is truly fixed-rate and the query lands exactly on the frame grid, t × fps gets you the right displayed frame. The trouble is that a reader cannot assume those conditions. The query arrives in seconds because upstream systems speak in seconds: labels, camera clocks, subtitles, scene changes, and sampling policies. The file answers in frame identities.

A frame number is only meaningful after you say which timeline it belongs to. “The 480th displayed frame” is a presentation index. “The frame at 16 s” is a time query. “All frames every 100 ms” is a sampling policy. Those can line up in a tidy clip, but they are not the same request.

The container gives each presented frame a presentation timestamp (PTS) and a duration, stored as integer ticks in the track's timescale. For constant frame rate video, those intervals usually form a tidy grid. For VFR video, dropped-frame pipelines, and screen recordings, they may not. Either way, the reader has to compare the query timestamp to the actual PTS intervals, not to a rounded fps label.

The selection rule is part of the workload, not something the container decides for you. Some readers pick the nearest PTS. Some pick the frame whose interval contains t. A scene-cut query may ask for the first frame after each cut. The important thing is not that one policy is universally right; it is that the policy is explicit before later stages talk about bytes or tensors.

MP4 answers these questions with index blocks. Some describe timing, some mark random-access frames, some describe codec setup, and some point from samples to bytes. In a classic MP4 most of that metadata sits in the movie index; in a fragmented MP4, smaller metadata runs can appear throughout the file beside the media data.

This is what tooling caches as a frame map. ffprobe can print a normalized view of frame PTS, duration, keyframe flags, and packet positions. TorchCodec can cache the same kind of timing data as frame_mappings. Those maps make timestamp lookup cheap and deterministic. They identify which presentation frames the query wants. The later sections explain what it costs to fetch and decode those frames.

Compression shifts work to decode

Video compression exploits two kinds of structure. Spatial redundancy is what every image codec starts from: pixels close together in a frame tend to have similar values. Frequency-domain transforms — DCT in JPEG and H.264, wavelets in JPEG 2000 — decorrelate that spatial redundancy into a sparse set of coefficients that quantize and entropy-code well. Temporal redundancy is video-specific: regions of a frame (blocks) look like regions in nearby frames, so the codec stores motion + residual per block instead of another full picture. That second kind is what makes video dramatically smaller than per-frame image compression.

The two sliders below pair the spatial and temporal halves of H.264 on the active video. Drag the JPEG quality knob to feel the DCT-quantize tradeoff; drag the motion slider to watch block matches reconstruct the next frame from a nearby one.

The third major piece is entropy coding. After spatial and temporal prediction have made the signal smaller, the encoder still has motion vectors, residual coefficients, block modes, and other syntax symbols to store. H.264 uses CABAC or CAVLC to spend fewer bits on likely symbols and more bits on unlikely ones. This stage is not another visual prediction step; it is statistical coding over the syntax left behind by prediction and quantization.

Spatial prediction reduces what must be stored inside a frame. Temporal prediction reduces what must be stored across frames. Entropy coding squeezes the remaining syntax into fewer bits. The temporal part is what changes the shape of a read: if frame 101 is described as a change from frame 100, the decoder needs frame 100 before frame 101 makes sense.

The frame graph

If every frame were predicted only from the previous one, reading any frame would mean decoding from the beginning of the file. That would compress well, but seeking would be miserable. Video codecs need reset points because humans also jump around in videos. Those reset points are I-frames: frames coded from their own bytes, without depending on earlier pictures.

Between those reset points, P-frames recover compression by predicting from earlier decoded frames. Then B-frames go one step further: they can interpolate from surrounding decoded frames, often using both an earlier and a later picture. Those choices turn a timeline into a dependency graph. At frame-level granularity, frames are nodes, and each reference is an edge from the dependent frame back to the frame that supplies decoded data.

IPBP-frame DTS 1 references I-frame DTS 0P-frame DTS 2 references P-frame DTS 1B-frame DTS 3 references P-frame DTS 1B-frame DTS 3 references P-frame DTS 2B-frame DTS 4 references P-frame DTS 1B-frame DTS 4 references P-frame DTS 2I-frame DTS 0, frame 00P-frame DTS 1, frame 11B-frame DTS 3, frame 23B-frame DTS 4, frame 34P-frame DTS 2, frame 42frame01234
selected DTS 3closure DTS 0, 1, 2, 3

The diagram shows an interesting result. The B-frame in playback position 3 references the data from the later P-frame in position 4. This only works if that P-frame is decoded first. That is why decode order diverges from display order, and why compressed video samples carry both DTS (decode timestamp) and PTS (presentation timestamp).

The real bitstream is messier than “P means previous frame, B means previous and next.” H.264 carries reference lists: ordered sets of decoded pictures that a frame is allowed to predict from. A P-frame chooses one reference per predicted block from list 0. A B-frame may choose from list 0, list 1, or combine one from each. The diagrams below collapse that block-level machinery into one node per frame and one edge per frame-level reference.

The decode closureof a frame consists of all frames that must be decoded before it. In other words, the transitive closure of the frame graph. The graph below starts with the combined closure for the selected access pattern; click any node to inspect that frame's individual closure.

Once the closure is known, the codec question is answered: these are the compressed samples required for the requested frame. The next problem is addressing. The reader still has to find where those samples live in the file.

From samples to byte ranges

At this point the query is no longer abstract. The selected access pattern names output frames, and the frame graph expands those into a closure of compressed samples. The container's job is to turn those sample identities into byte offsets and sizes.

The blue spans are the bytes a closure-aware reader would ask storage for before invoking the decoder. Adjacent samples collapse into one range. Scattered samples stay scattered. The counts in the figure are computed from the selected MP4's real sample table and, where the frame graph is available, its parsed H.264 references.

Object stores make every independent range pay latency and request overhead; SSDs and operating-system caches prefer nearby bytes. Good readers coalesce adjacent sample ranges and deliberately overfetch a little when one larger read is cheaper than many small ones, especially across a batch of clips.

The decoder boundary

A decoder turns a stream of compressed samples into decoded pictures. Software decoders do that on CPU cores. Hardware decoders — NVDEC on NVIDIA, VideoToolbox on Apple, VCN on AMD, Quick Sync on Intel — use fixed-function video blocks built for this exact bitstream work. NVDEC is not CUDA cores running a kernel; it is separate hardware sitting beside the CUDA cores.

compressed samples
  -> bitstream parse                    # slice headers, NAL units
  -> entropy decode (CABAC / CAVLC)     # serial, stateful
  -> inverse quantization               # parallel within frame
  -> inverse transform                  # parallel within frame
  -> prediction + reconstruction        # uses decoded references
  -> deblocking + loop filters
  -> decoded YUV surface

Entropy decode is the hard serial step. CABAC and CAVLC are adaptive: each symbol depends on context produced by previous symbols. Hardware makes that serial machine fast, but it does not make one bitstream arbitrarily parallel.

Its reference frames live in the decoder's decoded-picture buffer, and later frames depend on that state. A datacenter GPU may expose only a handful of hardware decode engines — roughly five to seven on modern NVIDIA chips — so throughput comes from keeping those engines fed with independent closures, not from splitting one closure across CUDA cores.

In our own H100 CUVID runs, hardware decode has become visible: sparse H.264 workloads can drive decoder utilization to the ceiling. Treat that as a warning about where the bottleneck can move, not as a universal throughput law. The exact limit depends on codec profile, resolution, bitrate, GOP shape, batching, scheduler behavior, and how post-decode CUDA work is placed relative to the model. The planner's defensible claim is narrower: fewer decoded closure frames and tighter byte ranges reduce the work handed to whichever decoder path is available.

Video surfaces and color conversion

Decoders do not usually output RGB tensors. They output video surfaces, commonly NV12 for 8-bit 4:2:0 video or P010 for 10-bit 4:2:0 video. Both are YCbCr layouts: luma is stored separately from lower-resolution chroma.

The conversion exists because RGB is a display-oriented representation, not a compression-friendly one. In camera video, the red, green, and blue channels carry a lot of the same brightness structure. A luma/chroma transform pulls that shared brightness signal into Y and leaves color-difference signals in the chroma planes. That separation lets prediction, transform coding, and quantization spend bits according to what each component actually contributes.

The lower chroma resolution is the next economy. Humans are much more sensitive to luma edges and texture than to high-frequency color detail, so most camera video stores fewer chroma samples than luma samples. In 4:2:0, each chroma plane has half the width and half the height of the luma plane: one chroma sample covers a 2×2 block of luma samples. That cuts chroma sample count to a quarter while preserving the brightness detail that carries most visible structure.

YUV/YCbCr is not inherently worse than RGB. The losses usually come from compression, quantization, and chroma subsampling, not from the name of the color transform. But converting a subsampled YCbCr surface into RGB requires chroma upsampling, a color matrix, range handling, and rounding. The exact choice matters when the model sees the result as numeric input.

Models want RGB mostly because the surrounding ecosystem does: pretrained image backbones, augmentation libraries, and normalization constants are conventionally defined over RGB image tensors. The video reader therefore still has work after decode: YCbCr to RGB, crop, resize, normalize, and assemble batches. On a GPU pipeline, those steps run on CUDA cores and move memory across host, device, and sometimes PCIe boundaries. If scheduled poorly, post-decode work competes directly with the model.

From timestamps to tensors

Now we come back to the full query flow:

timestamps
  -> presentation frames
  -> compressed samples (PTS -> DTS)
  -> decode closure
  -> byte ranges
  -> decoder bitstream
  -> decoded YUV surfaces
  -> RGB / crop / resize / normalize
  -> tensors

A general library such as TorchCodec or FFmpeg exposes this through a decoder-shaped API. It resolves timestamps or frame indices using demuxer state, or a timing cache such as TorchCodec's frame_mappings, seeks to an appropriate keyframe, feeds compressed samples in decode order, decodes forward until the requested presentation frame appears, and returns the requested frame or tensor. That is the right abstraction for broad compatibility: the decoder owns the hidden work.

The gap is that general-purpose libraries usually pay at the GOP level. Frame mappings make timestamp lookup cheap, but they do not make a P-frame or B-frame independently decodable. The common path still seeks to a sync sample, decodes forward, discards intermediate frames, and treats each request like a small playback.

A good warehouse reader can do better because it owns the batch. It can keep the MP4 sample index and timestamp mapping as cached metadata, parse the H.264 reference graph into a closure index, plan the union of closures for all requested outputs, coalesce the sample byte ranges, feed compressed samples in DTS order, and materialize RGB tensors only for output frames.

This is where Spiral's source reader enters the story. On an existing H.264 MP4, it can plan sparse byte ranges around the actual samples required by the requested frames, choose accelerated decode paths such as NVDEC when available, and schedule range fetches, bitstream assembly, decode engines, post-decode CUDA work, and model consumption as one pipeline. The win comes from tighter decode closures and better scheduling, not from a different interpretation of MP4.

There is still a ceiling. A source reader can avoid unnecessary work around the graph it is given, but it cannot make that graph simpler. If the requested frames sit behind long reference chains or scattered samples, the best reader still has to pay those closures. The closure and range counts above make that distinction explicit: better source reads help, but repeated access patterns eventually want a file written for the query.

Why not flatten it?

If video takes this much machinery to read well, the obvious question is why not flatten it into something easier: raw RGB tensors, one JPEG per frame, short-GOP encodes, fixed clips, thumbnails, embeddings. Those are all valid materialized views when they match a frequent query.

They are not replacements for the source. Flattening buys simpler access by spending storage, egress, compute, quality, or generality. The compression ratio is the reason we are willing to do the planning work.

There is also a third move: pre-materialize a derived view. Thumbnails and latents are not alternate encodings of the source so much as cached answers to expected questions. That can be excellent when the query is stable: a fixed thumbnail policy, a fixed embedding model, a fixed scene-cut detector. It decays when the question changes. A new model invalidates latents; a new UI may need middle-frame thumbnails instead of I-frame thumbnails; a scene-search workflow may want cuts rather than fixed posters.

The database question is not whether to flatten everything. It is which views are stable enough to materialize, which queries should read the source directly, and which repeated access patterns justify a new encode.

Designing files for repeated queries

A smart source reader plans around the frame graph it is given. If the warehouse also controls the writer, it can change the graph. For repeated machine views, the first goal is to make the useful frames land on smaller, more bounded decode closures instead of long irregular reference paths.

That imposed structure is not free. The encoder gives up some prediction freedom, so compression ratio can get worse. The bet is the same one databases make with indexes, clustering, and materialized views: spend some storage or encode efficiency so a repeated query becomes predictable and cheap to execute.

Control the frame graph

Start with the retained cadence, such as 30 fps to 10 fps. The encoder can make those retained frames cheap anchors, then fill the gaps with B-frame refinement layers. Exact cadences matter because every third frame can belong to the 10 fps retained layer while the intervening frames stay as full-rate refinement.

Exact cadence ladder
IPBBBBB-frame DTS 5 references I-frame DTS 0B-frame DTS 5 references B-frame DTS 3B-frame DTS 9 references B-frame DTS 5B-frame DTS 9 references B-frame DTS 3B-frame DTS 3 references I-frame DTS 0B-frame DTS 3 references B-frame DTS 2B-frame DTS 6 references B-frame DTS 3B-frame DTS 6 references B-frame DTS 2B-frame DTS 10 references B-frame DTS 6B-frame DTS 10 references B-frame DTS 2B-frame DTS 2 references I-frame DTS 0B-frame DTS 2 references P-frame DTS 1B-frame DTS 7 references B-frame DTS 2B-frame DTS 7 references B-frame DTS 4B-frame DTS 11 references B-frame DTS 7B-frame DTS 11 references B-frame DTS 4B-frame DTS 4 references B-frame DTS 2B-frame DTS 4 references P-frame DTS 1B-frame DTS 8 references B-frame DTS 4B-frame DTS 8 references P-frame DTS 1B-frame DTS 12 references B-frame DTS 8B-frame DTS 12 references P-frame DTS 1P-frame DTS 1 references I-frame DTS 0I-frame DTS 0, frame 00B-frame DTS 5, frame 15B-frame DTS 9, frame 29B-frame DTS 3, frame 33B-frame DTS 6, frame 46B-frame DTS 10, frame 510B-frame DTS 2, frame 62B-frame DTS 7, frame 77B-frame DTS 11, frame 811B-frame DTS 4, frame 94B-frame DTS 8, frame 108B-frame DTS 12, frame 1112P-frame DTS 1, frame 121frame0123456789101112

Exact-cadence layers are assigned from presentation position, but P-frames are sparse anchors every 12 display frames. Everything between anchors is a B-frame refinement tree. For 30 -> 10 fps, retained frames such as the every 3 frames positions can be B-refs; the prefix property is that their closure stays inside the retained cadence.

selected frame 6 · view layer 0 · DTS 2 · closure frames 0, 12, 6

This is more precise than “use shorter GOPs.” The encoder controls which frames are cheap anchors, which frames are enhancement layers, and how much closure cost a down-rate query pays.

Repack samples for prefixes

The second knob is physical sample order. Repacking writes the compressed samples so the low-rate ladder appears at the start of the file. A down-rate read can then fetch a prefix range instead of collecting scattered samples across the whole object.

Prefix view

display-order samples

0v0
1v1
2v1
3v0
4v1
5v1
6v0
7v1
8v1
9v0
10v1
11v1
12v0

prefix-packed sample order

prefix 5/13 samples

0v0
3v0
6v0
9v0
12v0
1v1
2v1
4v1
5v1
7v1
8v1
10v1
11v1
Repacking does not change decoded pixels. It changes physical sample order so every lower-rate ladder view is a prefix range: read the first N compressed samples, then stop.

The same idea is the physical version of the source-reader facts above. Ranges can drop because the retained view is prefix-readable in byte order. Closure size can drop because the GOP is bounded by design. The file is shaped around the access pattern, and the reader can extract the view with minimal work.

The decision framework, expressed as the tradeoff every database makes:

Access patternLayout choice
Unknown / one-offPlan against the source file. No transcoding.
Repeated / shapedTranscode to a layout that matches the shape.
Mixed / discovery-ledKeep source; precompute derived modalities.

A footnote on what the codec spec already wants to do. H.264 has a scalability extension — Scalable Video Coding (SVC), defined as Annex G of the H.264 standard — that puts spatial and temporal pyramids inside a single bitstream. One file can carry multiple resolutions and frame rates as nested sub-bitstreams, with the decoder extracting whichever sub-bitstream a query needs. In principle this is exactly what a multimodal warehouse wants. In practice NVDEC and most consumer decoders implement only the base profile; an SVC bitstream falls back to software decode and pays back every cycle the hardware was supposed to save. The layout work above ends up living at the warehouse layer rather than the codec layer.

Producing these layouts means a custom encoder that controls the frame graph, slice ordering, and sample byte order to fit a known retention pattern. The output is standard (non-scalable) H.264 — any decoder can read it — but the structure is shaped around the physical plan rather than the playback timeline. Spiral's transcoder is one realization of this.

Video as structured data

Video already has structure. Its physical layout was chosen for compression and playback, while machine workloads ask different questions. This deep dive followed that mismatch down: timestamp selection, MP4 sample addressing, H.264 dependency closures, byte ranges, decoder scheduling, color conversion, tensor materialization.

A warehouse has three ways to respond. Read the source intelligently for one-off access: cache frame maps and sample indexes, plan against the codec graph, coalesce byte ranges, materialize only what was asked for. Shape the compressed layout when access repeats: control the frame graph and sample order so the useful view has smaller closures and more predictable reads. Pre-materialize when the answer is stable and needed soon: cache thumbnails, clips, embeddings, or tensors as answers to expected questions.

These aren't competing philosophies; they're points in the same query plan. The mistake is not that video is unstructured — the mistake is treating compressed video as opaque to the planner.