Implementing Content Aware Encoding (CAE)

Recently, we were delighted to make a change that will improve bandwidth management for our customers. The Kaltura transcoding logic now applies Content Aware Encoding (CAE) to all content ingested into Kaltura (with exceptions made for a few specific customers to account for their special needs.)

What’s Content Aware Encoding (CAE)?

Until fairly recently, video was encoded according to set “bitrate ladders”, where a bitrate was paired with a resolution for several set intervals (such as 235 Kbps/320×240, 375 Kbps/384×288, etc.). If too low a resolution is chosen, the picture will appear fuzzy to viewers. If too high a bitrate is chosen, the viewer will experience buffering, while too low a bitrate will result in annoying encoding “artifacts”. The goal is to balance for an optimal viewer experience. The bitrate ladder was applied to all content to try to achieve an acceptable playback experience and bandwidth usage on average, without taking specific videos’ content into account.
Content Aware Encoding, on the other hand, examines each individual video’s characteristics and optimizes encoding accordingly. The point of CAE is to reduce the playback bandwidth, but still provide the same quality viewing experience. In many cases, the new bandwidth can be as low as half that of ‘non-CAE’ videos.
Basically, this means that the transcoding logic optimizes the encoding procedure per each content source, based on content complexity level. Less complex videos, such as videos with simpler animation or uncluttered backgrounds, require lower bitrates to get a sufficient quality level. Higher complexity, such as very busy backgrounds or lots of movement while requires a higher bitrate. By choosing the lowest acceptable playback bitrate for each individual video, more end users will get better user experience. Lower playback bitrate means more users will get a better streaming experience – meaning less buffering, al the while maintaining streamlined playback quality.
By limiting the encoded rendition output for content where high bit rate resolution is unnecessary – gain in bitrate efficiency is substantial.
For more details, read on.

Changing the Traditional Transcoding Heuristics

As noted before, non-CAE encoding process applies the same encoding ladder to all the ingested sources. Furthermore, all the assets are forced to the max rendition bitrate. In our case, this meant generation of all 6 flavors that were included in our default flavor set, even if there was no visual difference between the lowest flavor and the highest flavor.
Until quite recently, most companies used this mode. But now, that’s changing.

CAE Gains Popularity

In the beginning of this year, Netflix published this article that describes their approach to the CAE and their attempts to integrate this technology. Shortly afterwards, Streaming Media published a commentary on Netflix article and their version of CAE.
In the article, Netflix described the way they worked out their CAE flow: in-depth research of transcoding quality issues and comparison between various combinations of frame-size/bitrate and other parameters. This included both automatic quality measurements and manual inspections of the generated test contents. The resultant automatic flow analyzes those combinations to get the optimal encoding configuration for every new ‘title’/content. The results are very good, but the process is very time- and resource-consuming.
In the Streaming Media article, Jan Ozer described a much simpler approach that is based on the ffmpeg capability to force a specific quality level (via CRF parameter). Examining encodings with different CRF’s made it possible to define which CRF suits quality requirements and to make the encoding as visually similar to the source as possible with the lowest possible bitrate. This way, each source can have the maximal encoding bitrate defined. From this, it is possible to derive the rest of the ‘encoding ladder’.
This approach is much simpler than Netflix’s solution, but it is also less precise.
Examining YouTube renditions shows that they do not stick to rigid encoding ladder—different clips get different rendition/bitrate spreading. This means that YouTube applies CAE as well. This was confirmed a couple of months ago, when YouTube published this article explaining how they used machine-learning with Google Brain to implement their Content Aware Encoding.

Kaltura’s Original Transcoding Heuristics

Although it was not called ‘content aware’, Kaltura’s previous transcoding mode, had some CAE capabilities.

  • The source’s bitrate was used as a maximal bitrate for the asset transcoding. Therefore, if the source bitrate was low, let’s say 1000Kbps, the renditions that were set with higher bitrates (1500, 2500 and 4000), were not generated.
  • The encoding tools were not forced to produce the maximum bitrate. Instead, they were set to ‘attempt to provide max bitrate’. Quite often, the resulting bitrate was lower than maximum bitrate.

Still, this approach was not efficient. In many cases, there was no need for HD flavors, although the source bitrate was very high. There was no visual difference between HD assets and lower bitrate assets. For example, webcast videos, lectures, and user-generated content all looked the same.
So there were definitely gains to be made with Content Aware Encoding. But which approach was best for Kaltura?
The Netflix approach requires extensive processing for every source. Kaltura’s daily entry ingest load is 10K-60K. Therefore,  it wouldn’t be efficient to add that level of additional processing. On the other hand, YouTube’s flow evolved from long machine learning research. We’re not planning on challenging Google’s machine learning expertise any time soon.
After experimenting with many approaches, the approach described by Streaming Media looked promising.

Checking the Streaming Media Assumptions

The first phase was to verify that the ‘forced quality’ (CRF) approach can predict the required max bitrate.
This phase involved defining 6 content categories: Film, Simple Animation, Real World Action, Talking Heads, Screencam, and Webcasting. For each category, we examined several samples. For each, we evaluated on multiple criteria:

  • We generated each sample with several forced quality levels.
  • We measured source-to-rendition quality reduction with PSNR (peak signal-to-noise ratio).
  • We evaluated visual playback.

These limited scope tests verified the initial Streaming Media assumptions: we can use CRF=23 forced quality rendition to predict the required rendition bitrate. This is the ‘Source Complexity’ value.

Source Complexity Evaluation

The Source Complexity should be determined before any asset conversion can take place. The simplistic way would be to run full source conversion (with CRF=23) in order to get the Source Complexity value. Although it is much simpler and shorter than Netflix’s flow, conversion of an HD source into an HD rendition might last several hours in Kaltura’s current transcoding environment. Applying this to ALL of the sources would double (or at least significantly increase) the entry time-to-ready and would probably overload Kaltura transcoding resources.
Therefore, we decided to limit the Complexity evaluation time.
The method used was to generate 20 samples (1 second each), spread throughout the whole file. We calculated average I and P frame sizes and used them in order to estimate the final bitrate. This kind of processing takes on average around 30 seconds, and for the longest case, approximately 60 seconds.

Verification of the Source Complexity Evaluation Process

The verification process required a much larger number of samples. We randomly selected several thousand entries from the last 3 months. This list was used for all following proof-of-concept tests.
For the verification tests, around ~100 samples were used. In most cases, the sampled complexity evaluation results were ~10-20% higher than the ‘non-sampled’ complexity evaluation. But there were several cases when the sampled results were ~30-40% lower than ‘non-sampled’. Since the sampled complexity value would be used to set the max rendition bitrate, the resulting rendition files would have insufficient quality.
In order to avoid quality reduction issues, the final encoding flow limits the Content Aware Encoding ‘gain’ to be at most 50% of the transcoding parameters video bitrate value. For example, if the transcoding parameters bitrate is set to 4000Kbps and the source complexity bitrate is 1000Kbps, the CAE logic will set the max bitrate to 2000Kbps. (Despite that, the complexity level is 1000Kbps).

Encoding tests

For roughly 1000 random sample entries, we used the asset’s command lines to generate the proof-of-concept files. The video bitrate in those command lines was changed to the highest of the recommended source complexity bitrates (see above). We tested 50% of the predefined asset bitrate. All the other encoding parameters remained the same.

Quality Tests

Lowering bitrate causes some quality degradation. The goal of this proof-of-concept phase was to check whether the final quality of the files that were generated in ‘Content Aware’ mode was still sufficient, despite the lower bitrate. PSNR was used as a main quality metrics tool due to the fact that there is quite a lot of data linking the PSNR values and the subjective quality perception. For each sample, PSNR was calculated both for the original asset files and for the POC renditions.
The following are some PSNR values-of-interest:

  • PSNR>45: There is no visual difference between the reference and the tested file. Any attempt to generate renditions with PSNR values higher than 45 is a waste of resources.
  • PSNR=~40: Good conversion quality.
  • PSNR<35: Artifacts are visible.


  • Assets with a PSNR difference smaller than 0.1 can be considered to be ‘GOOD’
  • Assets with a PSNR difference smaller than 0.3 can be considered to be ‘OK’
  • All results that are >43 are ‘sufficient’ even if the asset’s PSNR is considerably higher (for example: POC 43, asset 45).

Test Methods

  • FFMpeg/PSNR: This test was used to generate asset and POC file PSNR metrics (vs. source in both cases). This was applied to all POC files.
  • MSU VQMT: In order to verify the FFmpeg results, we ran this test for limited number of files. The FFMpeg results were very close in most tests. In all cases, the difference between POC and asset results were similar both for MSU VQMT and for FFMpeg. MSU VQMT tool has many limitations that makes it impossible to use for large scale tests.
  • Visual test: POC and asset files proved to have a similar quality.

Test results

For high quality sources, the PSNR delta between POC and asset was, in most cases, very small: in the ‘GOOD’ to ‘OK’ range (see above). For low quality sources, the PSNR difference was disturbingly high: up to 1-2 (before the 50% gain limitation, it was up to 4 and sometimes higher). But despite the large PSNR gap, there was no visual quality difference between POC files and the assets.

Content Aware Encoding Integration:

Phase I

We activated Content Aware Encoding for several test customers to see whether there are any issues or complaints. In parallel, we tested some of the content (both playback & PSNR) ad monitored to make sure no customer complaints and issues cropped up.
The next step was to activate CAE for one of our ‘content intensive’ customers. Following are some resulting stats:

Samples AVG Source AVG Asset
CAE 3921 7602 7952
NON-CAE 6041 7258 9663

This is a comparison of 3 weeks of the content intensive customer’s load, with and without CAE, for a source with bitrate>2000Kbps.

  • AVG Source: average source bitrate in that period
  • AVG Asset: average sum of asset bitrates, per entry

Assuming that the content style in both 3 week periods is approximately the same, the ‘AVG Asset’ of CAE period is ~20% lower than for NON-CAE. The customer had no issues or complaints.

Content Aware Encoding Integration:

Phase II

Here we are today. Having thoroughly tested CAE to our satisfaction, we have now officially changed the Kaltura transcoding logic to apply Content Aware Encoding (CAE) to all content ingested to Kaltura. The default ‘ContentAwareness’ value was changed to 0.5, activating CAE as a default for all Kaltura transcodings.
We’re excited to move into this new age of Content Aware Encoding. By reducing the chance of buffering without significantly impacting the crispness of the image, we can offer our clients and their end users a significantly improved overall experience, while reducing the bandwidth load. Enjoy!

Interested in seeing this work for yourself? Sign up for a free trial of MediaSpace.


Let's Get Going