How Video Codecs Work
From raw video to compressed stream - understanding the codec pipeline
Video codecs compress video by removing spatial and temporal redundancies through a pipeline of partitioning, prediction, transformation, quantization, and entropy coding. Understanding this pipeline reveals why modern codecs like H.264, HEVC, and AV1 achieve such impressive compression ratios while maintaining visual quality.
Raw Video Sizes: The Compression Challenge
Before we can appreciate compression, we need to understand just how large uncompressed video really is. Let's break down the numbers for common resolutions:
- SD (640ร480): ~55 MB per second at 30 fps
- HD (1280ร720): ~221 MB per second at 30 fps
- Full HD (1920ร1080): ~497 MB per second at 30 fps
- 4K UHD (3840ร2160): ~1.99 GB per second at 30 fps
- 8K UHD (7680ร4320): ~7.95 GB per second at 30 fps
These numbers assume 8-bit color depth (3 bytes per pixel). Professional video often uses 10-bit or 12-bit depth, making these sizes even larger. At these bitrates, even a short video would be impractical to store or transmit without compression.
The Video Codec Pipeline: Six Stages of Compression
Whether it's H.264 from 2003 or the latest AV1 codec, all modern video codecs follow essentially the same six-stage pipeline. Each stage serves a specific purpose in reducing redundancy while preserving perceptual quality:
- Partitioning: Divide frames into blocks for localized processing
- Prediction: Estimate block content to reduce what needs encoding
- Transform: Convert residual data to frequency domain
- Quantization: Reduce precision of transform coefficients (main lossy step)
- Entropy Coding: Losslessly compress the quantized data
- Reconstruction: Rebuild frames for use in prediction (feedback loop)
Let's walk through each stage with a simple 8ร8 pixel block example:
Frame Types: I, P, and B Frames Explained
Video codecs use different frame types to balance compression efficiency with random access and error resilience. Understanding these frame types is key to understanding video streaming and video conferencing trade-offs.
A typical encoding sequence for streaming might look like: I B B P B B P B B P B B I ... This provides excellent compression while allowing efficient seeking (go to any I-frame) and error recovery (loss of a frame doesn't propagate infinitely).
Motion Compensation: Beyond Simple Block Matching
Early motion compensation used simple block matching with integer pixel precision. Modern codecs use sophisticated techniques that significantly improve compression efficiency:
- Variable block sizes: From 4ร4 to 64ร64 partitions, adapting to content
- Fractional pixel accuracy: Quarter-pixel (H.264) or eighth-pixel (HEVC/AV1) precision
- Multiple reference frames: Using more than just the previous frame
- Weighted prediction: Accounting for lighting changes and fade effects
- Motion vector prediction: Predicting motion vectors from neighboring blocks
Transform Coding: Why DCT Works So Well
After motion compensation, we're left with prediction residuals - the differences between predicted and actual pixel values. These residuals often have special properties that make them ideal for transform coding:
Energy compaction: Most of the signal energy is concentrated in a few low-frequency coefficients. This lets us aggressively quantize (reduce precision of) high-frequency coefficients that contribute less to perceived quality.
Where ฮฑ(0) = 1/โ(2N) and ฮฑ(u) = 1/โN for u > 0, and f(x,y) is the input block.
The DCT is particularly effective because it approximates the Karhunen-Loรจve Transform (KLT) for first-order Markov processes, which closely models the correlation in natural image data.
Quantization: Where the Bits Are Saved
Quantization is where the actual compression happens. We reduce the precision of transform coefficients by dividing them by a quantization step size and rounding to integers:
Where QStep(u,v) comes from a quantization matrix that can be adjusted based on perceptual importance (humans are less sensitive to high-frequency loss).
Higher quantization = more compression = more loss Lower quantization = less compression = better quality
Modern codecs use adaptive quantization, allocating more bits to complex regions (edges, textures) and fewer bits to flat regions (sky, walls) based on psychovisual models of human perception.
After quantization, we apply entropy coding to losslessly compress the quantized coefficients:
- H.264/AVC: CAVLC (Context-Adaptive Variable Length Coding) or CABAC (Context-Adaptive Binary Arithmetic Coding)
- HEVC/H.265: CABAC with improved context modeling
- AV1: Tabaci range coder (adaptive binary arithmetic coding)
- VVC/H.266: Enhanced CABAC with partition modeling
In-Loop Filtering: Closing the Quality Gap
Quantization introduces artifacts, particularly at block boundaries. To prevent these artifacts from propagating and accumulating through frames (via the prediction loop), modern codecs apply in-loop filtering:
- Deblocking filter: Reduces blocking artifacts at block boundaries
- Sample Adaptive Offset (SAO): HEVC/VVC feature that reduces banding and ringing
- Adaptive Loop Filter (ALF): HEVC/VVC feature that adapts to local image statistics
- Compatibility: Will your target devices support it?
- Complexity: Do you have the encoding/decoding power available?
- Licensing: Are you prepared to pay royalties or deal with patent licensing?
- Encoding time: Is this live streaming (fast encoding needed) or video on demand (can encode slowly)?
These filters are applied during reconstruction, so their effects are included in the reference frames used for prediction - preventing error propagation.
From Concepts to Real Codecs
Understanding the pipeline helps explain why different codecs make different trade-offs:
The choice of codec often comes down to: