Lesson 1 of 6

How Video Codecs Work

From raw video to compressed stream - understanding the codec pipeline

Imagine trying to stream the entire Lord of the Rings trilogy in uncompressed 4K quality. You'd need approximately 1.3 terabytes of bandwidth - that's like trying to drink from a fire hose through a straw! Yet Netflix delivers this same content using only about 20 gigabytes. This 65:1 compression ratio isn't magic - it's the result of sophisticated algorithms that exploit how humans perceive video and the inherent redundancies in visual data.
TL;DR

Video codecs compress video by removing spatial and temporal redundancies through a pipeline of partitioning, prediction, transformation, quantization, and entropy coding. Understanding this pipeline reveals why modern codecs like H.264, HEVC, and AV1 achieve such impressive compression ratios while maintaining visual quality.

Raw Video Sizes: The Compression Challenge

Before we can appreciate compression, we need to understand just how large uncompressed video really is. Let's break down the numbers for common resolutions:

These numbers assume 8-bit color depth (3 bytes per pixel). Professional video often uses 10-bit or 12-bit depth, making these sizes even larger. At these bitrates, even a short video would be impractical to store or transmit without compression.

Fun fact: A 2-hour movie in uncompressed 8K would require over 57 terabytes of storage - roughly equivalent to 14,000 DVDs or 280 Blu-ray discs!

The Video Codec Pipeline: Six Stages of Compression

Whether it's H.264 from 2003 or the latest AV1 codec, all modern video codecs follow essentially the same six-stage pipeline. Each stage serves a specific purpose in reducing redundancy while preserving perceptual quality:

  1. Partitioning: Divide frames into blocks for localized processing
  2. Prediction: Estimate block content to reduce what needs encoding
  3. Transform: Convert residual data to frequency domain
  4. Quantization: Reduce precision of transform coefficients (main lossy step)
  5. Entropy Coding: Losslessly compress the quantized data
  6. Reconstruction: Rebuild frames for use in prediction (feedback loop)

Let's walk through each stage with a simple 8ร—8 pixel block example:

๐Ÿ”ฒ Stage 1: Partitioning

Modern codecs use quadtree partitioning to adapt block sizes to content complexity. Simple areas use large blocks; detailed areas use small blocks.

โžก๏ธ Stage 2: Prediction

Instead of encoding every pixel, we encode motion vectors that describe how blocks move between frames, plus the small prediction error (residual).

๐Ÿ“Š Stage 3: Transform (DCT)

The Discrete Cosine Transform converts spatial pixel data into frequency coefficients. Most visual information is concentrated in low-frequency coefficients (top-left), allowing aggressive compression of high-frequency details.

๐Ÿ”ข Stage 4: Quantization

Quantization reduces the precision of transform coefficients. Higher QP values mean more compression but more visible artifacts. This is the primary lossy step in video compression.

Frame Types: I, P, and B Frames Explained

Video codecs use different frame types to balance compression efficiency with random access and error resilience. Understanding these frame types is key to understanding video streaming and video conferencing trade-offs.

  • I-frames (Intra-coded): Encoded independently using only spatial information from within the same frame. Serve as random access points and reference for other frames. Largest size but most resilient.
  • P-frames (Predictive): Encoded using motion compensation from previous I or P frames. Medium size, good compression, require previous frame for decoding.
  • B-frames (Bidirectional): Encoded using motion compensation from both previous and future frames. Smallest size, best compression, but require both directions for decoding and increase latency.
  • A typical encoding sequence for streaming might look like: I B B P B B P B B P B B I ... This provides excellent compression while allowing efficient seeking (go to any I-frame) and error recovery (loss of a frame doesn't propagate infinitely).

    Trade-off: B-frames compress best (~50% smaller than P-frames) but require more memory (to store future frames) and processing (more complex motion compensation).
    ๐Ÿ“ผ GOP (Group of Pictures) Structure

    The GOP structure defines the pattern of I, P, and B frames. Smaller GOPs improve seeking and error resilience but reduce compression efficiency. Larger GOPs increase compression but make seeking slower and error propagation worse.

    Motion Compensation: Beyond Simple Block Matching

    Early motion compensation used simple block matching with integer pixel precision. Modern codecs use sophisticated techniques that significantly improve compression efficiency:

    Common misconception: Motion compensation doesn't track objects - it tracks pixel blocks. A person walking might be represented by dozens of different motion vectors as their clothing, limbs, and background move differently.

    Transform Coding: Why DCT Works So Well

    After motion compensation, we're left with prediction residuals - the differences between predicted and actual pixel values. These residuals often have special properties that make them ideal for transform coding:

    Energy compaction: Most of the signal energy is concentrated in a few low-frequency coefficients. This lets us aggressively quantize (reduce precision of) high-frequency coefficients that contribute less to perceived quality.

    F(u,v) = ฮฑ(u)ฮฑ(v) ฮฃ ฮฃ f(x,y) cos[((2x+1)uฯ€)/(2N)] cos[((2y+1)vฯ€)/(2N)]

    Where ฮฑ(0) = 1/โˆš(2N) and ฮฑ(u) = 1/โˆšN for u > 0, and f(x,y) is the input block.

    The DCT is particularly effective because it approximates the Karhunen-Loรจve Transform (KLT) for first-order Markov processes, which closely models the correlation in natural image data.

    Quantization: Where the Bits Are Saved

    Quantization is where the actual compression happens. We reduce the precision of transform coefficients by dividing them by a quantization step size and rounding to integers:

    Q(u,v) = round( F(u,v) / QStep(u,v) )

    Where QStep(u,v) comes from a quantization matrix that can be adjusted based on perceptual importance (humans are less sensitive to high-frequency loss).

    Higher quantization = more compression = more loss Lower quantization = less compression = better quality

    Modern codecs use adaptive quantization, allocating more bits to complex regions (edges, textures) and fewer bits to flat regions (sky, walls) based on psychovisual models of human perception.

    After quantization, we apply entropy coding to losslessly compress the quantized coefficients:

    Try adjusting the quantization slider above to see how increasing QP reduces the number of non-zero coefficients (shown in the quantization demo) while increasing visible artifacts.

    In-Loop Filtering: Closing the Quality Gap

    Quantization introduces artifacts, particularly at block boundaries. To prevent these artifacts from propagating and accumulating through frames (via the prediction loop), modern codecs apply in-loop filtering: