Learn audio decoding and rendering with Cavern

Codecs Speaker placement Rendering Matrix coding Home objects Streaming issues

Object encoding in home mixes

Glossary

Object: A moving audio source.
Aggregate object: A single object playing the sound of multiple objects for storage and/or performance optimization.
Bed: A non-moving audio source matched with one or more output channels. The bed is generally a 7.1.2 mix of music, ambience, and environmental reverb.

Limitations

Home cinema systems are heavily constrained by bandwidth limitations, and the following specifications are followed to adapt for the weakest link.

While the codecs support up to 64 tracks (like E-AC-3 with JOC), more than 16 was not observed. The reason for this is that the object movement/extraction info is just enough to fit next to a single audio frame, keeping it just in range to still be allowed to be transfered through heavily bandwidth limited channels like HDMI ARC.
Channels and objects are not mixed together. The tracks are either all objects or all beds, with the exception of the LFE, which is always an object. Although most codecs support mixing the two, this behavior was not discovered in movie or music tracks, and as such, it's considered undefined behavior. Movies are almost exclusively object-only and music are generally channel-only.
Some object-based codecs support object sizes. However, this is both undocumented and was not observed in any movie or music. Expanding sounds are mixed to multiple channels or objects.

How objects are downmixed

This is only an advertisement and keeps Cavern free.

Safe approach

Because object sizing and mixed typing is disencouraged, in this approach, the entire mix is pre-rendered to a fixed channel layout, which is then exported in the metadata as either completely channel-based, joint channel-based, or object-based content. Cavernize and Enhanced AC-3 Merger practically work this way.

Real-world examples are overwhelmingly joint channel-based, with the inactive channels' objects being moved back to the default location (in EAC-3's case it's the front left channel) without any encoded audio, but not deactivated. This behavior can be observed using Cavern Driver's Demo Player. In these cases, 7.1.2 is mostly active because the before mentioned bed mixing techniques, and other channels up to 9.1.6 occasionally get active. This behavior can lead to a quickly drawn conclusion that the content is a 7.1.2 downmix - which is not. Cavern's encoder does not move the unused channels back to the default position, and some newer movie encoders also behave this way.

Having a completely channel-based mix does not degrade audio quality in any way. Since home setups currently are limited to 9.1.6, having all channels available will render correctly on any possible channel arrangement. For smaller rooms, even matrix downmixing is perfectly accurate, it results in the same exact mix as if it was rendered to that specific layout. This happens thanks to balance-based rendering, which, after performed multiple times to smaller layouts, is the same as initially rendering for the small layout.

Graphic approach

To appeal to the object visualizers and display some movement, but still respect the limit of 16 objects, aggregation of object groups is one solution. Objects close to each other can be mixed together at their averaged position until the object limit is met. This setup will always show object movement when the original mix contained any, but because of the averaging and re-rendering, this method is considered lossy. From all downmixing methods, this is the only one that results in degraded audio quality, even at the channel count it targets.

Combined approach

The recommended and minorly future-proof method is based on the safe approach. When there are lots of object movement, a fallback to completely channel-based rendering is the only viable way of preserving the entire soundstage and having an accurate mix, thus, it's the only method that should be used. However, when the number of active bed channels and dynamic objects is less than the codec's object limit, it's possible to convey all of these, and store both channels and objects as output tracks that will all be objects in the resulting file. Because of the occasional additional channel usage on larger layouts than 9.1.6, and no loss of spatial precision, this is the recommended method of storing spatial mixes. The combined approach serves as a middle ground for all use cases, sometimes displaying active and precise object movement, while always being spatially accurate. This is the current general practice.