Learn audio decoding and rendering with Cavern

What's in this topic?

Codec qualities and rendering methods, especially for object-based formats, are frequently discussed online. This is a disambiguation and description of such codecs based on developer documentations and their specific implementation in Cavern. These pages can explain common misunderstandings, proving why proper channel placement is not described in degrees, or why the quantity of object movements is not descriptive of content quality.

Basics of encoding: audio streams and tracks

We first have to describe the two things audio track can mean. The first is an audio track of a movie: a disc or streamed movie can contain different languages, commentaries, or hearing/visually impaired audio tracks. While "track" is a regularly used word in players to describe these, the technical term is audio stream. What decoding software actually call an audio track, is a single component of an audio stream. Audio streams are generally split to three parts:

  • Metadata header: General descriptions of the entire stream, like language, length, or codec. Present only once in the beginning of the file, so streams can be listed and selected.
  • Block/data header: Technical information about how to play the stream, like the number of tracks and how detailed each track is. This is sent periodically (many times a second) to the decoders, like your AVR, so they would know how to interpret the data, as they don't see the files, only the audio stream you selected.
  • Encoded data: Raw audio data for each track, but specially coded and sometimes compressed. Compression means the data is contained in a smaller place than it was before. This can be done by non-destructive methods, such as finding repetitions and only storing them once, we call this lossless compression. We could also throw away some less important sounds, like too high pitched sounds we could barely hear, this is called lossy compression.
For some formats, like RIFF WAVE, the two kinds of headers are combined, and there's only a single header in the beginning of a file.

Pulse code modulation (PCM)

What every track of every stream eventually comes down to, is their representation of sound, called PCM. Nearly every digital audio device handles sound this way. Sound, physically speaking, is just the slight change of air pressure. Microphones record this air pressure many times a second, and to play it back, we move a speaker's membrane the same exact way. In an ideal system, the same movement of a speaker driver creates the same sound. We don't live in a perfect world, and this is why a solution like Cavern QuickEQ is needed to correct audio reproduction systems as much as possible. The change of air pressure over time is called a sound wave or waveform. When we measure air pressure at a single time, it's called a sample. The number of samples we take a second is called the sampling rate.

PCM means we take samples at fixed intervals, so with a fixed sample rate, and each sample is an absolute value representing the actual sound pressure. The precision of a sample is called the bit depth, usually described in the bits of data they each take up. The bitrate of a stream is the total amount of data it needs every second. Let's calculate a simple example:

  • Movies are generally 48 kHz, this means there are 48000 samples taken a second.
  • A typical bit depth for content made for home playback is 16 bits per sample.
  • To get the bitrate for a single track, we can multiply the two together: 16 bits * 48000 samples = 768000 bit/s, which divided by 1000 is 768 kbit/s.
  • For a stereo content, there are two tracks: one for the left and one for the right speaker. The last thing we need is to multiply the bitrate of a single track with the number of tracks: 2 tracks * 768 kbit/s = 1536 kbit/s. This is the stream's bitrate.

This is only an advertisement and keeps Cavern free.

Channel-based formats

We didn't define what an audio track is exactly, and didn't call them channels, because they can be different things. A track just means a distinct sound, it could even fly around in modern formats. Traditionally, a track was called a channel. Each channel had a different speaker assigned to it by standard. If a stream had two tracks, with no metadata next to them, every audio system just assumed it was a stereo pair, and the first track was played on the left speaker, and the second track on the right. Additional channels could be added up to a point, there is a standard channel order for 5.1 and 7.1 systems, but not everyone agreed on what it should be. This is the point when the data header of most formats and simple WAV files started to contain the channels present in the encoded data, and their order, to prevent playing the wrong track on the wrong speaker. We call this channel mapping. Channel mapping can be extended to any number of channels, even RIFF WAVE supports 18 channels, 6 of which is height, in a format from 1998.

In short, channel-based formats simply contain audio tracks and sometimes metadata about which speaker that track should be played on. A track assigned to a speaker is called a channel.

Object-based formats

After a point, we can't just increase the number of speakers and assume everyone will buy as much speakers as channels we contain in a file. In the early 2010s, a new paradigm was born: what if we don't contain channels, but each sound in a separate track, and just tell where that sound should be coming from? Thus, object-based audio was born: an object is a track with data about where it is in the room. These are the bubbles you see in videos showing off object-based audio, like the one on Cavern's Welcome page. Object-based streams don't strictly contain just objects. They can still have fixed channels, e.g. for stereo music or the bass of the LFE track. The channels contained in an object-based file are called bed channels. Bed channels are handled exactly as if they were in channel-based formats: they are directly assigned to a speaker. If the speaker does not exist, as it can happen in files that might contain as many as 16 channels, nearby speakers will share its sound. About how it's done precisely, you can read on the Rendering tab.