Remember flipbooks? Those tiny little booklets with a series of images that very gradually change from one page to the next? You could flip through those pages in quick succession and the images would appear to be moving. Flipbooks (along with its circular predecessor, phenakistiscope) are one of the earliest examples of animation devices.

Modern videos are essentially flipbooks for pixels on a screen. Every page, called a frame, tells the screen the color and the intensity every pixel on the screen must take for the duration of that frame.
Let’s say that you are trying to create a digital video by putting together a bunch of digital images and displaying them on screen in rapid succession, much like you would with a flipbook. How much digital storage space would you need to store such a video?
Let’s do some math!
Assuming that we want to create a 1080p video (a very common standard these days), every image that you use would need to be 1920 pixels wide and 1080 pixels tall. Assuming a frame rate (number of frames run through in a second) of 24 (a standard that is used in movies to this day) and the size per image to be (a very reasonable) 2MB, the total size of a mere 1 minute video constructed this way would be almost a whopping 3GB (24 x 60 x 2)!
At such large sizes, you’d quickly run out of storage space before you could store even a couple of movies on your PC. Video heavy platforms such as YouTube or Instagram or TikTok wouldn’t exist as streaming costs would be downright prohibitive, not to mention the the kind of fast internet speeds you’d need to watch such a video that’s streaming over the internet without buffering.
This is where video encoding comes to the rescue! Video codecs (portmanteau for encoder-decoder) are software (or hardware) that compresses and decompresses digital videos.
Before we move ahead, we need to talk about the difference between encoding and compression.
Encoding refers to representing a piece of information in another form. A common encoding standard you’d perhaps be familiar with is ASCII. In ASCII, the letter A is represented by the decimal number 65, which is in turn represented in binary as 1000001. In this example, the letter A went through two separate encoding layers - one to convert it into its corresponding ASCII code (ASCII encoding) and then again to convert the ASCII code into binary (binary encoding). We do this because computers only understand binary. These encoding steps ensure that the letter A can be stored and transmitted in a format that a computer understands.
Decoding is the reverse process of encoding. In the above example, the binary 1000001 is first converted into its ASCII counterpart, 65, using the binary decoder which is then converted into the human readable letter A using the ASCII decoder.
Video codecs work in a similar fashion. If we imagine videos as a flipbook for pixels on a screen like we did earlier, then we need to come up with a standard for storing this information. A video encoder understands how every pixel on the screen lights up for every frame of a video and converts this information into a standard well defined format that is universally accepted. A video decoder on the other hand understands how its encoder counterpart represents the pixel information for every frame and uses this information to read the encoded file to tell the screen which pixels to light up when and how.
Compression on the other hand is the art of making things fit more tightly together so that they take up less space. In the real world, taking apart your Ikea furniture and stuffing it in a small box is an example of compression. Building it back up is an example of decompression. Similarly in the digital world, compression refers to stuffing data into as less bits as possible and decompression refers to getting the original data back.
Very often, you’ll hear the terms encoding and compression used interchangeably. This is because compression is a type of encoding. Making something smaller involves converting the original into something different. One of the biggest challenges when dealing with media revolves around making its size smaller.
Hence, throughout this article, you’ll find that I use encoding and compression interchangeably, even though they are subtly different.
To understand how video encoders work, let’s imagine a small 1 bit black and white TV screen that contains pixels in a 5x5 arrangement. Being a 1 bit screen, every pixel can take a value of either 0 or 1, denoting whether the pixel is on or off.
As an example, a frame from a video playing on this TV showing a diamond would look like this:
You could encode this in a matrix form like so:
Let’s assume that the video is simply the center pixel turning on and off once every second. Once again, let’s take a look at how this would look like on our 1 bit TV screen:
Assuming that the TV screen has a refresh rate of 5 fps (i.e. 5 frames in sequence are displayed one after the other in every second), you’d have to store one such matrix corresponding to every frame. If this video runs for 1 minute, you’d need to store 300 such matrices (5 x 60), one per frame.
This clearly seems to be wasteful. Not only does the animation repeat itself, there is a change on the screen only once every 5 frames. So how about we optimize the encoder and encode this video in the following way:
Of course, the decoder will need to be taught how to interpret this information, but the instructions are clear and it achieves the same output as the encoding we did previously. Instead of storing 300 individual matrices, we now only need to store 2.
But we can make this even more efficient. As you can see, only one pixel changes every 5 frames. So rather than even storing 2 separate matrices, we can store just the initial matrix and then encode the fact that just the pixel at (2,2) needs to be inverted every 5 frames.
If we assume that the matrices take up overwhelmingly more space than the instructions, this reduces the size by another 50%.
While this was an extremely simplified example, encoders fundamentally function this way. Encoding research (for any form of information, whether be it videos, images, audio, text or files) focuses primarily on finding ways to stuff as much data as possible in the smallest space.
One most note though that a high level of compression might not always be desirable. Higher compression often means that the transformations applied to the original data are more complex in nature. This means that the encoding processes will either cost more compute power, take more time to complete, or both.
More often than not, the decoding process for data that is encoded using complex transformations is also complex, which could lead to the decoder sharing the same drawbacks as the encoder as mentioned above. Moreover, increased decoding time leads to an increase in latency for the receiver.
A quick side note since we are on this topic. A complex encoding process doesn’t always equate to a complex decoding process. An example of this is very commonly used public-key cryptography. Here, a the public key is used to encrypt a message using very complex mathematical calculations (typically based on the difficulty of factoring large composite numbers) whereas decryption using the corresponding private key is a rather computationally easy process.
Similarly, you can also create a system where the encoding is simple but the decoding is more complex. A hamming code based error correction system is a good example of this.
All the examples we saw above are examples of lossless encoding. A piece of information is said to be encoded lossless if there exists a decoder that convert this information back into its original form without any loss of data. We call all the encodings in our examples lossless because a decoder could read them to get back all the original 300 frames without hitting a single error for any pixel of any frame.
Lossless encoding is extremely important when it comes to encoding text or most types of files. For example, if I were to encode the following text - ‘John reached the airport at 11 AM’ and the best attempt at decoding the encoded text by a decoder changes the time in the text to ‘9 AM’, the message that the text conveys completely changes; such an encoding would be pretty useless.
The opposite of a lossless encoder is a lossy encoder. In fact, the text encoding example we took just before is a good example of a lossy encoding. The decoder was able to get most of the original information back, bar the error in the time.
While lossy encodings sound like a bad idea when it comes to encoding textual information, they are essential when it comes to encoding visual media. Let’s dig a little deeper into why this is so.
Images and videos are essentially instructions on how to light up pixels on a screen. Every pixel is a light source that emits a combination of red, green and blue lights, each at a different level of intensity. The final decoded output of every frame of a digital video (or of an image) at the hardware level tells every pixel on the screen how to shine.
Screens these days have millions of pixels. For example, a 1080p screen has over 2 million (1920 x 1080) pixels. A 4K screen has over 8 million pixels.
Now imagine an encoder that encodes an image or a video (frame) and encodes 100 pixels extremely wrong - turning 100 blue pixels into red.
On a 1080p screen with over 2 million pixels, this is an error of less than 0.005%. This error is so small that it will go unnoticed by the human eye.
As another example, if the error was less blatant, for instance, the pixel color was encoded as light blue instead of dark blue, you could easily have, say, 1% of the total pixels encoded incorrectly this way and still have it go unnoticed.
The amount of information that your brain has to process from an image is overwhelmingly larger than, say, processing a piece of textual information. This means that while your brain could pick up errors in text easily, images are going to be more harder. In fact, the longer you stare at an image, the more pixel errors (that exist) you are likely to find.
Pixels on every frame of a video can in fact have more errors as a percentage of total pixels on the screen compared to that of an image without you noticing it. A typical movie runs at 24 fps. This means that a frame of a video stays on the screen for less than 1/24th of a second. This means that your eyes have such a short span of time to spot errors in a frame that you can get away with a whole lot of nasty encoding ‘mistakes’ (I put mistakes in quotes because you can’t really call something a mistake if it is intentional, can you?).
But why would we intentionally introduce errors into our encoding process? Granted that our eyes can tolerate a lot of individual pixel errors when it comes to videos, wouldn’t it still make more sense to encode it lossless? Is there any advantages to lossy encoding?
To understand this, let’s take a look an encoding example that is quite a ways different to the one we discussed earlier.
Say I have the following pairs of numbers: (0,0), (1,1), (2,4), (3,9), (4,16), (5,25), (6,36), (7,49), (8,64), (9,81), (10,100).
Instead of storing all these number pairs individually, I can store the mathematical function y = x2, x ∈ Z, 0 ≥ x ≥ 10.
As you can see, storing this function takes up way less space than writing down all the pairs of numbers individually. In fact, the more pairs that we need to write down, the more efficient it is to just write the function, as the size of the function remains constant while the number pairs take up more space the more pairs we have.
Converting the number pairs to the function is an example of an encoding. A decoder for this could simply plug in the values for x within the prescribed limits of the function and get back the number pairs that we started with. Additionally, as you may have observed, this is an example of a lossless encoding.
Now let’s put in a little twist. Let’s change 2 of the input pairs (5,25) and (9,81) to (5,26) and (9,82) respectively. If the decoder now tries to use the function to get the original result back, it will fail to do so accurately, as the these two pairs of numbers don’t fit inside the function y = x2.
What do we do now? Should we find a new function that works with these 2 changed input pairs? First of all, finding such a function would be difficult and if we were to do so through brute force, it might take quite a bit of computational effort. And even if we were to find such a function, it would be way more complex than our existing function y = x2 and indeed, quite a bit longer.
What if we decide to use the same function, y = x2, as the encoded representation for our input, even with the 2 changed number pairs? Yes, it is not accurate, but the values are close enough. When x is 5, the function returns 25. Currently, our input number pair for 5 is 26. What if 26, which is very close to 26, was close enough for us for all practical purposes? Could we then just use our simpler function, y = x2 and get huge savings for some really minor errors?
This is similar to how a lossy encoder works. Remember how quite a few percentage of pixels on the screen of a frame can be in error without our eyes ever registering it? Following the (very simplistic) example above, would it not make sense to approximate some pixels on the screen if it meant that you could save more space and/or reduce the complexity of the encoding/decoding process?
Video encoding research focuses on finding the right balance between quality, space and codec complexity. It is a game of give and take.
Reduce space and increase quality and you would end up with a codec with very high complexity. If you have an encoding scheme that is easy to encode and decode and create a high quality video, it will have very high space requirement. You get the gist. You cannot optimize for all 3 at the same time.
While anyone could build their own video codec, there are a small set of standard codecs that are widely used and accepted by a large number of platforms and software today. Standardization of codecs is a humongous task in itself but it is an important task nonetheless. If standardized codecs were not a thing, you’d need to encode every video into potentially dozens of different encodings to meet the needs of every platform or software you want it to be played on.
One such standard that you have definitely used is H.264 - most of the videos that you watch in MP4 format are encoded using this codec.
Video codecs are complex pieces of software (and hardware) that are ever evolving. New algorithms that push the boundary of what’s possible in this space pop up every so often, partly fueled by advances in hardware. While it is almost impossible (and arguably unnecessary) to truly understand the nitty gritty details of every popular video codec out there, understanding how codecs work, in general, is essential when approaching video engineering.