Learn audio decoding and rendering with Cavern

Matrix-based audio coding

Not all encodings or compressions require having their own separate format or type of audio stream. Some can pack themselves into other streams by either using some of the inaudibly low bits (like Auro-3D's encoder called Octopus does) or by merging channels that can somehow be unmerged. Matrix encodings are the latter. It means there exists a table where each column is a source (input) channel and each row is a target (output) channel. Its mathematical term is called a matrix. The matrix which creates the compressed audio is called the encoding matrix, and the one that reverses it is called the decoding matrix.

Encoding matrices

A very basic example for matrix mixing is adding a center channel to a stereo mix and separating it back. Its encoding matrix looks the following:

Channel Left in Center in Right in
Left out 1 √2 / 2 0
Right out 0 √2 / 2 1

We mix a 3.0 input into a 2.0 output by adding the center channel to each output. The left output receives the full left input and some of the center, and the right output receives the full right input and some of the center. For this reason, they are sometimes called Left total (Lt) and Right total (Rt). The center is mixed by a factor of √2 / 2, which means a constant power mixing is applied. When you mix one audio source to sound from perfectly between two speakers, this is the amount of mixing that's needed to keep its volume. This phenomenon is better described on the Rendering tab. The output became a stereo mix with the center panned to the middle perfectly, which, acoustically, could be called lossless, because we hear the center from the center, even if it's not there (assuming a perfect calibration of course).

Decoding matrices

To get back the center, we need to use another matrix mix from 2.0 to 3.0, specifically, this decoding matrix:

Channel Left in Right in
Left out 1 -√2 / 2
Center out √2 / 2 √2 / 2
Right out -√2 / 2 1

Multiple strange-looking things can be noted here. First, the center is mixed back with a √2 / 2 factor from both the left and right channels. If you calculate it, this is perfectly what we want, a resulting factor of 1, that means the volume was not changed. If we mix it to two outputs with a factor of √2 / 2, then combine those parts with another factor of √2 / 2, we get √2 / 2 encoding * √2 / 2 decoding * 2 channels = 1. The other thing is what looks like subtracting the right channel from the left and the left channel from the right. We need this to remove the center we added from each one. There is one glaring issue with this though: doesn't it remove sounds that were originally panned between the left and right channels? Yes, it does, but it's not a problem. In a perfectly calibrated system, those sounds would have originated from the same virtual location as where the center is. In a non-perfectly calibrated system, this makes spatial cues better, because it forces these sounds to originate from the center speaker. The issue is an audible distortion if the channels are not time-aligned, which means this method of decoding both mathematically and acoustically lossy.

This is only an advertisement and keeps Cavern free.

Higher order matrix codings

Matrix coding can be extended to any channels, because the previously described center separation can be extended to any number of intermediate channels. If you had a quadro signal, so 4 corner channels as input, they could be split up for more, and with good calibration, acoustically losslessly. If you had 8 channels, 4 additional ones in the top corners, you could have full spatial sound encoded into just 4 channels. Cavern's renderer can encode anything to these 8 channels, and its NearestUpmixer can try to reverse it as much as possible, even into thousands of channels. This is one method how Cavern upmixes 9.1.6 to 64 channel installations.

So why is it so rarely used for umpixing if the previous example shows that we can upmix anything? Because in a non-calibrated system, or if someone doesn't sit in the center, audible distortions appear. The reason Cavern can get away with it is the 9 available channels on the ground layer: distortions become heavily directional, so they aren't affected by seating positions that much, and can be mitigated greatly with calibration. Technically, any matrix decoder can be used for upmixing: they will produce the extra channels from any input, not just encoded inputs. However, you need specifically encoded content for the best results, for example, PlayStation 2 games for experiencing true Pro Logic II. Speaking of Pro Logic II:

Phase-shifted matrix codings

While we can add and subtract audio signals between channels, signals mathematically exist on two planes. We can "rotate" them, so they interfere way less with other signals. This rotation is called a Hilbert-transform or 90° phase shift. For engineers and mathematicans, the Hilbert-transform is an impulse response of 1 / (π * t). Because this is a very heavy distortion, until it's reversed, it can barely be heard in the encoded signal. If we encoded the center with this method, even the encoding became lossy. Phase shifting is marked with complex numbers in the matrix, +90° is noted as a multiplication with j and -90° is noted as k. For example, encoding the center channel with a phase shift would be:

Channel Left in Center in Right in
Left out 1 j * √2 / 2 0
Right out 0 k * √2 / 2 1

This is how Dolby Stereo encoded a single surround track before Pro Logic.