Lesson 1 of 6

How Video Codecs Work

From raw video to compressed stream - understanding the codec pipeline

Imagine trying to stream the entire Lord of the Rings trilogy in uncompressed 4K quality. You'd need approximately 1.3 terabytes of bandwidth - that's like trying to drink from a fire hose through a straw! Yet Netflix delivers this same content using only about 20 gigabytes. This 65:1 compression ratio isn't magic - it's the result of sophisticated algorithms that exploit how humans perceive video and the inherent redundancies in visual data.

TL;DR

Video codecs compress video by removing spatial and temporal redundancies through a pipeline of partitioning, prediction, transformation, quantization, and entropy coding. Understanding this pipeline reveals why modern codecs like H.264, HEVC, and AV1 achieve such impressive compression ratios while maintaining visual quality.

Raw Video Sizes: The Compression Challenge

Before we can appreciate compression, we need to understand just how large uncompressed video really is. Let's break down the numbers for common resolutions:

SD (640×480): ~55 MB per second at 30 fps
HD (1280×720): ~221 MB per second at 30 fps
Full HD (1920×1080): ~497 MB per second at 30 fps
4K UHD (3840×2160): ~1.99 GB per second at 30 fps
8K UHD (7680×4320): ~7.95 GB per second at 30 fps

These numbers assume 8-bit color depth (3 bytes per pixel). Professional video often uses 10-bit or 12-bit depth, making these sizes even larger. At these bitrates, even a short video would be impractical to store or transmit without compression.

Fun fact: A 2-hour movie in uncompressed 8K would require over 57 terabytes of storage - roughly equivalent to 14,000 DVDs or 280 Blu-ray discs!

The Video Codec Pipeline: Six Stages of Compression

Whether it's H.264 from 2003 or the latest AV1 codec, all modern video codecs follow essentially the same six-stage pipeline. Each stage serves a specific purpose in reducing redundancy while preserving perceptual quality:

Partitioning: Divide frames into blocks for localized processing
Prediction: Estimate block content to reduce what needs encoding
Transform: Convert residual data to frequency domain
Quantization: Reduce precision of transform coefficients (main lossy step)
Entropy Coding: Losslessly compress the quantized data
Reconstruction: Rebuild frames for use in prediction (feedback loop)

Let's walk through each stage with a simple 8×8 pixel block example:

Frame Types: I, P, and B Frames Explained

Video codecs use different frame types to balance compression efficiency with random access and error resilience. Understanding these frame types is key to understanding video streaming and video conferencing trade-offs.

I-frames (Intra-coded): Encoded independently using only spatial information from within the same frame. Serve as random access points and reference for other frames. Largest size but most resilient.

P-frames (Predictive): Encoded using motion compensation from previous I or P frames. Medium size, good compression, require previous frame for decoding.

B-frames (Bidirectional): Encoded using motion compensation from both previous and future frames. Smallest size, best compression, but require both directions for decoding and increase latency.

A typical encoding sequence for streaming might look like: I B B P B B P B B P B B I ... This provides excellent compression while allowing efficient seeking (go to any I-frame) and error recovery (loss of a frame doesn't propagate infinitely).

Trade-off: B-frames compress best (~50% smaller than P-frames) but require more memory (to store future frames) and processing (more complex motion compensation).

Motion Compensation: Beyond Simple Block Matching

Early motion compensation used simple block matching with integer pixel precision. Modern codecs use sophisticated techniques that significantly improve compression efficiency:

Variable block sizes: From 4×4 to 64×64 partitions, adapting to content
Fractional pixel accuracy: Quarter-pixel (H.264) or eighth-pixel (HEVC/AV1) precision
Multiple reference frames: Using more than just the previous frame
Weighted prediction: Accounting for lighting changes and fade effects
Motion vector prediction: Predicting motion vectors from neighboring blocks

Common misconception: Motion compensation doesn't track objects - it tracks pixel blocks. A person walking might be represented by dozens of different motion vectors as their clothing, limbs, and background move differently.

Transform Coding: Why DCT Works So Well

After motion compensation, we're left with prediction residuals - the differences between predicted and actual pixel values. These residuals often have special properties that make them ideal for transform coding:

Energy compaction: Most of the signal energy is concentrated in a few low-frequency coefficients. This lets us aggressively quantize (reduce precision of) high-frequency coefficients that contribute less to perceived quality.

F(u,v) = α(u)α(v) Σ Σ f(x,y) cos[((2x+1)uπ)/(2N)] cos[((2y+1)vπ)/(2N)]

Where α(0) = 1/√(2N) and α(u) = 1/√N for u > 0, and f(x,y) is the input block.

The DCT is particularly effective because it approximates the Karhunen-Loève Transform (KLT) for first-order Markov processes, which closely models the correlation in natural image data.

Quantization: Where the Bits Are Saved

Quantization is where the actual compression happens. We reduce the precision of transform coefficients by dividing them by a quantization step size and rounding to integers:

Q(u,v) = round( F(u,v) / QStep(u,v) )

Where QStep(u,v) comes from a quantization matrix that can be adjusted based on perceptual importance (humans are less sensitive to high-frequency loss).

Higher quantization = more compression = more loss Lower quantization = less compression = better quality

Modern codecs use adaptive quantization, allocating more bits to complex regions (edges, textures) and fewer bits to flat regions (sky, walls) based on psychovisual models of human perception.

After quantization, we apply entropy coding to losslessly compress the quantized coefficients:

H.264/AVC: CAVLC (Context-Adaptive Variable Length Coding) or CABAC (Context-Adaptive Binary Arithmetic Coding)
HEVC/H.265: CABAC with improved context modeling
AV1: Tabaci range coder (adaptive binary arithmetic coding)
VVC/H.266: Enhanced CABAC with partition modeling

Try adjusting the quantization slider above to see how increasing QP reduces the number of non-zero coefficients (shown in the quantization demo) while increasing visible artifacts.

In-Loop Filtering: Closing the Quality Gap

Quantization introduces artifacts, particularly at block boundaries. To prevent these artifacts from propagating and accumulating through frames (via the prediction loop), modern codecs apply in-loop filtering:

Deblocking filter: Reduces blocking artifacts at block boundaries
Sample Adaptive Offset (SAO): HEVC/VVC feature that reduces banding and ringing
Adaptive Loop Filter (ALF): HEVC/VVC feature that adapts to local image statistics

These filters are applied during reconstruction, so their effects are included in the reference frames used for prediction - preventing error propagation.

From Concepts to Real Codecs

Understanding the pipeline helps explain why different codecs make different trade-offs:

H.264/AVC (2003): Established the modern hybrid codec paradigm. Good balance of complexity and efficiency. Still widely used for compatibility.

HEVC/H.265 (2013): ~50% better compression than H.264 at cost of increased complexity. Introduced CTUs (64×64), improved motion compensation, and better transforms.

AV1 (2018): Royalty-free alternative achieving HEVC-like efficiency. More complex encoding but better parallelism and open licensing.

The choice of codec often comes down to:

Compatibility: Will your target devices support it?
Complexity: Do you have the encoding/decoding power available?
Licensing: Are you prepared to pay royalties or deal with patent licensing?
Encoding time: Is this live streaming (fast encoding needed) or video on demand (can encode slowly)?

○ Mark as complete

← Home Lesson 2 →