Streaming is everyday now. It’s no longer the special new thing, it’s the expected norm and it should no longer be a great mystery. Yet for many, it remains mystical and imposing.
Understanding how things work and why they work is required if you want to know how best to apply technology and plan for its release. But that understanding does not need to come with complexity and confusion.
Recently I was invited to speak at Worldcon76 in San Jose, California. It is the premiere science fiction literary convention and where the Hugo Awards are voted on and given out. I had the honor of appearing on a panel discussing how to improve involvement by young people in S.T.E.A.M. aka Science, Technology, Engineering, Arts and Mathematics, alongside folks like Kevin Roche, an advisor and scientist for IBM’s Almaden’s Magneto-electronics and Spintronics group and the Chair of Worldcon76.
On the panel we dived into why people often avoid technical professions and seem overwhelmed by technology. A core point that Roche discussed was that those of us in the technical professions can at times make them seem far more complex out of our own need to feel validated or just out of convenience. This comes at the price of understanding and at the greater price of losing people’s interest in these fields.
This got me thinking about Transcoding and the document I recently updated for Kaltura, “The Best Practices for Multi-Device Transcoding 2018 Edition.” The goal of the document is to make Transcoding easier to understand and cut out some of the mystique. This is the first of three blogs where I will present excerpts of the Best Practices and discuss how those excerpts are ultimately simple to understand.
We’ll first start with the basic concepts of streaming ranging from what encoding actually means to bitrates and codecs.
But here is a simple analogy to get things started. Streaming is not much different than animation. When you see a cartoon (or really, any film), it is comprised of individual drawings or pictures. When these are shown sequentially at a certain frame rate the illusion of movement and life is created. Your eye no longer sees individual pictures but rather continuous movement. This phenomenon is known as persistence of vision and it’s very similar to how encoding works. The encoder sees each portion of a video as a series of animations called “A Group of Pictures”. But just like in animation where every micro-movement of a character is not presented (forcing your brain to fill in the gaps in between movements to make it feel smooth), the encoder does the same thing. Between each main frame, aka Key Frame, guesses are made by the encoder since it can not represent each frame fully if it is to compress the video enough to make it streamable.
Now you’re ready to dive into the basic concepts and settings of encoding – so let’s take a leap:
General Settings and Concepts of Transcoding
Before we can dive into the specific device groups and their intended settings it is important to understand the essentials of transcoding. This section will detail the most commonly used settings and terminology for transcoding.
First, let’s start with the basic idea of video encoding. Video encoding is taking a source moving image sequence and compressing it into a format that is then readable, or decodable, by an end player or set of players. The reason I use the heavy phrase “moving image sequence” is that a video or film really is simply a series of still images playing back at a certain speed (frame rate). In order to take, for example, an old fashioned 35 MM film, and put it into a computer for editing, the film has to be scanned one frame at a time and the information in each 35MM frame translated into 1’s and 0’s. That translation is encoding.
Now “transcoding” is taking a previously encoded piece of video and translating it further into another format or process
Transcoding will convert the source file into one or more newly and more compressed streams that can then be played in a player on a computer or mobile device depending on the settings and methodologies used.
USE CASE: A video editor has 4K Source footage @100 Mbps from the camera used to capture the content. But the 4K video is too large in resolution and bitrate for his editing system to handle. He’d prefer to be able to edit in 1080 instead. He could convert that 4K source footage by transcoding it into a new proxy file that is 1080p at a much more compressed 7000 Kbps with a 1 second GOP or Key Frame Interval. He can then, once done editing, relink the original 4K files and output a 4K source.
A Bitrate is a measurement of data speed across a network, often in Kilobits per second or kbps (1000 bits per second). This number correlates with potential bandwidth levels that a user may experience and should be in balance to the resolution of the stream.
A household that has a data plan limited to 10 Mbps cannot handle a bit rate over 6500 kbps. You may wonder why, if that household can handle up to 10 Mbps, they would only be able to handle 6500 kbps. Why not the full 10 Mbps? The answer is because though the average data rate may be 6500 kbps for that video stream, it will spike at least 30% up to 50% of the average at various points in the video stream if the content creator has transcoded using variable bit rate, which is a common method. Additionally, that home bandwidth is limited by other devices using it. The player or device displaying the video might also have additional plugins/widgets, etc., such as analytics tools & DRM (Digital Rights Management) that might add additional overhead. Therefore, a buffer must be considered.
If you are using Akamai HD HTTP Streaming, bit rate spikes are less of an issue, as it uses client-side caching allowing for a higher threshold. However, if network conditions worsen, those spikes may still present a problem. Again, it is always about a balance between performance and visual quality.
A video encoder can choose between a few different methodologies for managing bitrate. These are:
Constant Bitrate: the video bitrate does not change regardless of the image complexity.
Variable Bitrate: the video bitrate fluctuates depending on the specific complexity of changes from one frame to another. If there are few changes from frame to frame, those frames can be predicted and compressed easier, allowing a lower bitrate for those sections. If there are big changes from frame to frame, say for a movie trailer, a higher bitrate or level of complexity may be required to avoid visible quality artifacts.
Average vs Max
The Video Bit rate has two main components: The Average and the Max.
AVERAGE: The average bit rate for video should coincide with the target bandwidth of the end user and should be in balance with the resolution.
Just changing a bit rate alone is not sufficient for dealing with bandwidth limitations and is not recommended. It is advised that low bitrates also scale down in resolution so that the end user has a good balance between image and playback quality.
MAX BIT RATE: A Max Bit Rate governs the ceiling that a variable bit rate may reach and should be within balance of the average and the target connection/device. Max Bit rates do not affect progressive download. Some services like Akamai have some built in functionality that limit the impact of bit rate spikes. That being said, not everyone uses Akamai, and progressive download is not often ideal for streaming.
Potential performance issues due to High Max Bit Rate include unwanted bit rate switching, stutter, player crash, and buffering.
Industry standards say you should calculate the Max bit rate by taking the Average and adding 50%. It is recommended to reduce this even to 30% in order to create a truly consistent stream while still taking advantage of a variable bit rate.
USE CASE: Say a transcoded stream has an Average of 1400 kbps but spikes at 2600 kbps. That user may experience one of the above performance issues if their bandwidth can only support between 1000 kbps and 2000 kbps. This is highly likely for many users whose ISP’s cap their data plans. This would mean that when the stream spiked to 2600, it is outside the threshold that the user can handle.
Bits per Pixel
Bits per pixel is a measurement of how many bits are assigned to each pixel in the encoded stream. The higher the number, the better the image quality and accuracy as compared to the source file color and sharpness. But remember, visual quality is also qualitative and subjective. Our eyes are only capable of seeing so much detail, which is why compression works so great. Our eyes fill in the ridges and smooth over any encoding visual artifacts such as blocking. Also other features used to encode the video may mask any quality issues, such as when noise reduction is utilized.
A big attribute of streaming video is the Buffer. The Buffer is where data is held until it is needed. The way streaming video works over an HTTP type connection is the video and audio are coming into the player inside packets of data. These packets stream into the buffer and then as a player needs them they are pulled from the buffer and displayed. The rate at which this happens is determined by the encoded streams buffer settings. It is recommended that the buffer be set to 150% of the bitrate. So if the average video bitrate is 4000 Kbps, the buffer should be at least 6000 Kbps, or 1.5 seconds, though you can go higher with newer devices, especially smart TV’s or connected devices which may support up to 2 seconds or higher.
Think of the buffer as a cup and the player as a very thirsty runner. If the thirsty runner, or player, goes to drink from the cup and there is no data in it, then the thirsty runner might trip and fall down from lack of good hydration (aka a player crash). Or they might simply stumble (or buffer) a little until more water can be found in the cup. When the cup is empty but the player is thirsty, this is called a Buffer Underrun.
Now you might fill that cup too quickly, and it might start to spill over, losing valuable data, causing buffering or stuttering issues as well – this is called Buffer Overrun. So you want that flow into the cup to be in balance with the needs of the playback and always be at a steady pace.
Codec is short for Code/Decode and refers to the methodology and encoding library used to transcode the video or audio. In order for a stream of video or audio to be played back, the receiving player must be able to decode the video and present it properly.
For streaming video, the primary codec currently in use is called H264 (aka Advanced Video Codec or AVC). H264 is the descendant of the MPEG codecs used for Video Recording, DVD, and Broadcast. MPEG stands for Moving Picture Experts Group and is the consortium of industry professionals who have created the MPEG standard and methodologies for distributing digital video. H264 is more specifically referred to as MPEG-4 part 10 when discussing its development. A license-free version of the H264 standard is X264 – an open source version that meets the MPEG standard.
H264 is broken down into 3 different profiles of complexity: Baseline, Main, and High. Each profile is further divided into Levels that govern the intensity of certain functionalities and features and allow larger bitrates and resolutions as the level increases. These settings are crucial for good device playback, as not all devices support all the various levels and combinations. The differences in the profiles comes down to the maximum resolution and bitrate it supports and certain predictive features such as allowing B-Frames.
BASELINE: Baseline is the original profile level back when streaming video was first being used primarily for video conferencing. It does not support VBR and B-Frames and therefore will have less visual detail than Main or High profiles and will contain an increase in encoding artifacts.
MAIN: Main is the next profile level utilized. It includes better motion prediction. B-Frames and Automatic Scene Detection.
HIGH: Even better color accuracy, motion prediction and detail. Harder to decode.
The level utilized determines how high of a resolution and bitrate is possible as well as other features that get turned on as the level increases. They range from Level 1 through Level 6.2. Here are some nominal settings for Profiles and Levels as they relate to resolution:
360p Baseline L3
720p Main L3.2
1080p High L4.0
2160p High L5.0
Not all devices handle higher levels and profiles, therefore it is important to check device specifications for proper balance and limits of these settings. Some of this will be discussed in the Devices section later in this series.
A Brief History of Encoding
Current encoding methodologies got their start 30 years ago at the advent of digital video in television production. They were at first limited by traditional broadcast standards for televisions receiving their signals over an antennae and being displayed on a 4×3 interlaced CRT (aka a square television) and the limited data rate of a Compact Disc (CD) of 1.5 Mbps.
It all began with the H261 Codec put into use in 1988 and created by the Video Coding Experts Group. H261 would go on to become the first standard for video conferencing but resolutions and bitrates were limited (resolutions 288 and 144 with bitrates only up to 2 Mbps possible).
Also in 1988, the Motion Picture Experts Group, or MPEG, formed to tackle digital video standards as a consortium of experts came together to set the stage for streaming video’s explosion 20 years later. The MPEG standard would adopt H261 as its first codec and become the basis for the standard that is still used today.
Each standard MPEG would release improved on the previous one in significant ways, adding profiles and levels that could be adjusted, de-noising filters, and ultimately introducing B-Frames and P-Frames, allowing further compression.
In the next post, we will discuss streaming methodologies, adaptive switching and the current worldwide industry stats such as who currently has the fastest bandwidth. In the final post, I will break down the different device types and accompanying streaming settings.