R³ – Spatial Audio Formats: A Technical Deep Dive

Feb 11

By Audiocube.app and Noah Feasey-Kemp (2025)

Introduction

Spatial audio refers to advanced sound formats that recreate a three-dimensional sound field, allowing listeners to perceive sound sources in space around them (including above and below). Unlike traditional stereo or mono, spatial audio techniques aim to capture or render sound with realistic directionality and immersion. This white paper provides a deep technical exploration of spatial audio formats – their history, how they store 3D audio information (including binaural techniques), comparisons of major formats, and practical guidance on integrating these formats with Audiocube (a Unity-based 3D audio workstation). We will examine how spatial audio data is encoded compared to standard PCM audio (e.g. WAV), and discuss exporting Audiocube’s 3D audio into formats like Ambisonics, Dolby Atmos, and MPEG-H 3D Audio. The target audience is audio engineers, developers, and industry professionals, so we focus on technical details and standards.

Historical Overview of Spatial Audio Formats

Early Experiments (1880s–1930s): The quest for immersive sound began soon after the invention of audio recording. One of the first binaural audio demonstrations dates back to 1881 with Clément Ader’s Théâtrophone, which used twin telephone lines to send a two-channel (left/right) live opera performance to listeners wearing a telephone earpiece for each ear (dear-reality.com). This gave an early sense of stereo imaging and directional hearing. In 1931, engineer Alan Blumlein pioneered stereo recording techniques, noting that a single speaker in cinemas failed to localize sound with on-screen action. Blumlein developed a stereo recording system and filed a patent that laid groundwork for multi-directional sound reproduction. Around the same time, researchers also experimented with dummy-head setups to capture binaural cues (placing microphones in an artificial head) to replicate human hearing for lifelike playback, though such experiments remained niche.

Emergence of Surround Sound (1940s–1970s): The concept of true surround sound was introduced in cinemas in the mid-20th century. Walt Disney’s 1940 film Fantasia used an experimental system called Fantasound with multiple audio channels to move sounds around the theater, including simulating a bee flying around the audience. This system deployed about five channels and was a technical marvel for its time, though it was too costly and complex to install widely. By the 1950s and 60s, stereo became common in music and film, but the idea of multi-channel audio saw renewed interest by the late 1960s. Notably, the concept of Ambisonics was formulated in the UK in the late 1960s as a method to record and reproduce a full 360-degree sound field. Invented by Michael Gerzon and colleagues, Ambisonics promised true 3D sound capture (including height) using a special encoding of the sound field rather than discrete speaker feeds. However, early Ambisonics faced commercial challenges and required significant processing to decode; it was not widely adopted by manufacturers in that era.

In parallel, the consumer and music industries in the early 1970s saw experiments with quadraphonic sound (4.0 surround) which placed four speakers in a square layout. Quadraphonic formats (from RCA, CBS, etc.) allowed music producers to place instruments around the listener for more spatial depth (dear-reality.com). However, competing quadrophonic standards and the expense of extra speakers led to its market failure. Instead, the industry gravitated toward simpler surround formats. By the late 1970s and 1980s, Dolby Laboratories introduced matrix-encoded surround for film: Dolby Stereo (1976) could carry four channels (Left, Center, Right, a Surround) encoded in two channels on film soundtracks. This was followed by Dolby Surround and Dolby Pro Logic for home use, which similarly matrix-encoded a rear channel into stereo signals. These provided a rudimentary surround experience but were still limited in spatial realism.

Modern Spatial Audio (1990s–2010s): A major leap came with digital audio and discrete multichannel formats. In the 1990s, Dolby Digital (AC-3) and DTS introduced 5.1 surround sound (five full-range channels: front L/C/R, rear L/R + LFE subwoofer) in cinemas and eventually home theater. Discrete 5.1 (and later 7.1) channel audio became standard for DVD, HDTV, and Blu-ray, greatly improving spatial placement in the horizontal plane. However, these systems still lacked vertical placement (no height channels) and were tied to specific speaker layouts.

Interest in true 3D audio resurfaced in the 2010s. Two parallel approaches gained prominence: object-based audio and renewed scene-based (Ambisonic) audio. Dolby Atmos, introduced in 2012, is a landmark object-based format that adds height speakers and treats sounds as independent objects with 3D coordinates, rather than only channel-based feeds (professional.dolby.com). Dolby Atmos for cinema allows up to 128 simultaneous audio tracks, of which 118 can be discrete sound “objects” positioned freely in space, along with 10 channels reserved for bed mix (usually a conventional 7.1.2 bed). The theater’s playback system can have as many as 64 individually powered speakers (including overhead), and the Atmos renderer calculates how each object’s audio should be distributed to the available speakers in real time, based on metadata. This was a shift from fixed channel feeds to a more dynamic, scalable approach to immersive sound.

Following Atmos, other Next-Generation Audio (NGA) formats emerged. DTS:X (2015) is a similar object-based format from DTS, and Auro-3D (early 2010s) took a channel-based approach to add height layers (e.g. 9.1 or 11.1 layout). Around the same time, the Moving Picture Experts Group (MPEG) developed an open standard for 3D audio: MPEG-H 3D Audio, published in 2015 (soundguys.com). MPEG-H was designed to be a flexible codec supporting channel-based, object-based, and Ambisonics-based audio, with up to 64 speaker outputs and 128 coded channels for immersive sound. It was adopted as part of the ATSC 3.0 broadcast standard (first deployed in South Korea in 2017) and even underlies newer services like Sony’s 360 Reality Audio music platform.

Meanwhile, Ambisonics found a resurgence in Virtual Reality and 360° video applications. Platforms like YouTube and Facebook adopted First-Order Ambisonics (FOA) as the audio format for 360° videos by mid-2010s (brucewiggins.co.uk). The ability to rotate an Ambisonic soundfield in software to match a VR viewer’s head movements – then decode to headphones (binaurally) – made it ideal for immersive VR experiences (en.wikipedia.org) (brucewiggins.co.uk). This “VR audio” revival firmly established Ambisonics and binaural rendering as practical tools, even though these technologies originated decades earlier.

Today’s spatial audio landscape is rich but fragmented. We have legacy channel-based formats (5.1, 7.1, etc.), scene-based formats (Ambisonics of various orders), and object-based systems (Dolby Atmos, DTS:X, MPEG-H). Each has its strengths and specific use cases, but the lack of a single unified format means audio professionals must navigate multiple standards. The next sections will explain how these formats technically store spatial information and compare their features.

Spatial Audio Storage: Channels, Objects, Scenes, and Binaural Techniques

Storing or encoding spatial audio can be approached in different ways. Broadly, we categorize approaches as channel-based, object-based, or scene-based, and we discuss the role of binaural audio in rendering spatial sound. Each approach has different data formats and technical implications, especially when compared to a standard PCM format like WAV.

Channel-Based Audio: This is the traditional method where specific audio channels correspond to specific speaker locations. For example, a 5.1 surround WAV file will have six channels (perhaps in fixed order: Left, Right, Center, LFE, LS, RS) and each channel is meant for that speaker. The spatial impression is created by appropriate level differences and timing between these channels. A standard WAV file can carry multiple channels (e.g. 2 for stereo, 6 for 5.1, even more for 7.1.4 etc.), but the key is that the format itself doesn’t convey any 3D position beyond the implicit speaker assignment. The layout is fixed during production. Standard PCM (WAV) is just raw sample data plus a header that might list channel count and maybe a channel mask, but it has no intrinsic spatial metadata. It’s up to the playback system’s configuration to map those channels to space. Thus, compared to spatial formats, a pure WAV is limited – it cannot adapt to different speaker setups or head-tracking without remixing. It’s essentially a hard-coded spatial mix. Channel-based approaches (like stereo, 5.1) are straightforward and widely compatible, but they lack flexibility; you get the intended effect only on the matching speaker arrangement.
Object-Based Audio: In object-based systems, individual sound sources are not permanently rendered into specific channels at production time. Instead, each sound “object” is stored with its own audio track (usually mono or stereo) and accompanied by metadata describing its position (in 3D coordinates) and other properties (like spread or trajectory over time). At playback, a renderer uses the metadata to synthesize the appropriate signals for the listener’s setup – whether that’s a particular speaker array or headphones. Dolby Atmos is the prime example: it treats audio as a set of objects plus optionally some channel beds (Dolby Atmos Cinema Sound - Dolby Professional). The Atmos production file (e.g., within a Dolby TrueHD or Dolby Digital Plus container) contains audio channels plus metadata that the decoder (in an AVR, soundbar, or device) uses to “steer” sounds to the proper locations (Spatial Audio vs Dolby Atmos or same thing? : r/AppleMusic). In other words, Atmos is not delivered as raw discrete channels for each speaker; it is delivered as a package of audio + metadata instructions. A Reddit description puts it succinctly: “Atmos is a set of instructions (metadata) that tell your AVR how to steer the sound... The difference between Atmos and regular PCM is the metadata that tells the AVR how to steer the sound.” (Spatial Audio vs Dolby Atmos or same thing? : r/AppleMusic). This allows the same Atmos stream to play over many layouts – from a theater with 64 speakers to a 5.1.2 home system or even headphones – the renderer adapts the output. MPEG-H 3D Audio works on a similar principle but as an open standard. MPEG-H can carry traditional channels, objects, and even higher-order Ambisonics in one stream. It was built to be universal in representing all known paradigms (What is MPEG-H? Everything you need to know - SoundGuys). For example, an MPEG-H stream could contain several audio objects each with XYZ coordinates (and can be made interactive, allowing user control like turning up dialogue or choosing an audio angle). The decoder will render the scene to the actual speaker setup or to binaural if listening on headphones. Because object-based formats include metadata and require encoders/decoders, they are often delivered in specialized containers or bitstreams rather than plain WAV. Dolby Atmos in home use is encapsulated in Dolby TrueHD (on Blu-ray) or in Dolby Digital Plus (for streaming) with additional data (known as Dolby MAT or JOC), whereas MPEG-H is a coded bitstream (part of ISO/IEC standards) often in MP4 files or broadcast transport streams.
Scene-Based Audio (Ambisonics): This approach stores a representation of the entire sound field independent of any specific speaker layout. Ambisonics does this by encoding the sound field using spherical harmonics. The most common format is Ambisonic B-Format, which for first-order Ambisonics uses four channels: W, X, Y, Z. These channels are not “front-left” or “right-surround” signals, but rather a speaker-independent encoding of the sound field’s pressure and velocity components (Ambisonics - Wikipedia). Specifically, W is an omnidirectional (pressure) component, and X, Y, Z are figure-eight patterns oriented along three axes (representing directional gradient components). Higher-order Ambisonics extends this with more channels representing higher-order spherical harmonic terms for greater spatial resolution. A key point is that B-format can be decoded to any playback setup. The same B-format recording can be rendered over stereo (by converting to binaural), over 5 speakers, 8 speakers, or any number/arrangement by using an appropriate decoder algorithm. In Ambisonics, the producer is effectively working in terms of angles and directions rather than speaker feeds (Ambisonics - Wikipedia). This is why Ambisonics is often called scene-based – it’s a scene description (like a soundfield) that can later be mapped to speakers. One consequence: Ambisonic channels require decoding. On its own, a B-format WAV doesn’t sound correct if played directly over speakers without a decoder; it has to be decoded according to the listener’s context (similar to how an object-based renderer works, but via a mathematical transform rather than following discrete object metadata). Ambisonics was ahead of its time in this regard. As noted on Wikipedia: “Unlike some other multichannel surround formats, [Ambisonics] channels do not carry speaker signals. Instead, they contain a speaker-independent representation of a sound field called B-format, which is then decoded to the listener's speaker setup.” (Ambisonics - Wikipedia). This extra layer gives tremendous flexibility – the listener can use any speaker layout or even rotate the entire soundfield (for example, to match a VR headset orientation) before decoding (YouTube, Ambisonics and VR – The Blog of Bruce). However, Ambisonics historically suffered from incompatible conventions (different normalizations and channel orderings existed), causing confusion until recent times. Modern implementations have converged on a standard: ACN channel ordering with SN3D normalization (often referred to as AmbiX format), which is used by YouTube and many tools (YouTube, Ambisonics and VR – The Blog of Bruce) (Ambisonic data exchange formats - Wikipedia). First-order AmbiX uses 4 channels ordered as W, Y, Z, X (note the ordering difference: Y, Z, X for AmbiX vs the older Furse-Malham which was W, X, Y, Z). Higher orders simply append higher-degree components.
Binaural Audio: Binaural audio is somewhat unique in this discussion because it is not a distinct file format or encoding like the above – rather, it’s a method of delivering spatial cues over standard two-channel audio. Binaural recording or rendering uses the concept of HRTFs (Head-Related Transfer Functions) to filter audio so that when played through headphones, the listener perceives it as coming from specific directions in 3D. For storage, binaural audio is typically just a normal stereo audio file (like a WAV or MP3) that has been processed with special filtering. There’s no extra metadata required; the directional information is embedded in the sound via subtle timing (ITD) and volume (ILD) differences and spectral coloring that mimic how our ears receive sound from different directions (What Is Binaural Audio? Immersive Audio Deep Dive | Audiocube). Because of this, binaural content can be distributed in any standard stereo format – “you don’t need any special type of player or audio format – it can all be contained in standard audio formats” (What Is Binaural Audio? Immersive Audio Deep Dive | Audiocube). The downside is that binaural audio is “baked in” for a given head position – if the listener turns their head, the sound does not re-localize (unless the system dynamically changes the audio based on head tracking, which requires an interactive element). Binaural audio can be created in two main ways: recorded binaurally (using a dummy head with mics in the ear positions, capturing real HRTF effects) or rendered binaurally from a spatial mix (using digital HRTF filters on each sound source). In spatial audio workflows, formats like Ambisonics or object audio are often ultimately rendered to binaural for headphone playback. For instance, a VR application might store the audio scene in Ambisonics, then the app or video player decodes that scene to a binaural stereo output in real time, so that as the user rotates, the scene rotates accordingly and the binaural output updates. Binaural audio is incredibly important for headphone-based spatial experiences (VR, AR, personal audio) because it is currently the only way to convey elevation and full 3D through just two transducers. Its limitation is that it’s only optimal for headphones – playing binaural audio over speakers usually yields poor results (since each ear is hearing both channels).

To summarize, channel-based audio (like a standard WAV surround file) directly stores speaker feeds – simple but inflexible. Scene-based (Ambisonic) audio stores a complete 3D soundfield in a speaker-independent way – highly flexible but requires decoding and historically had multiple incompatible formats. Object-based audio stores individual sources with metadata – very flexible and interactive, but requires complex encoding/decoding and standards (often proprietary). Binaural audio leverages standard stereo formats to deliver 3D sound to headphones by imitating human ear cues, but it is typically a playback/rendering method rather than a distinct multichannel format. Modern spatial audio systems often combine these approaches (e.g., an MPEG-H stream might have some channel beds, some objects, plus a binaural downmix option, all together). Next, we examine specific mainstream spatial audio formats in detail – Ambisonics, Dolby Atmos, and MPEG-H – and compare their technical characteristics.

Mainstream Spatial Audio Formats: Technical Details and Comparison

Ambisonics (Scene-Based Full-Sphere Audio)

Ambisonics is a spatial audio technique and format family that represents the sound field using a series of spherical harmonics. The most common format, First-Order Ambisonics (FOA), uses four channels (B-format W, X, Y, Z as described earlier) to capture sounds coming from all directions around a point. Higher-order Ambisonics (HOA) uses additional channels to increase spatial resolution (for example, 2nd order has 9 channels, 3rd order 16 channels, etc., following the formula (N+1)^2 for Nth order). Ambisonics can capture a full 360° sphere of sound (horizontal and vertical) around the listener (Ambisonics - Wikipedia). It was initially developed in the 1970s (by Gerzon et al.), with early implementations like the Soundfield microphone to capture B-format.

Technical Encoding: In FOA B-format, the channels correspond roughly to: W = omnidirectional (zeroth-order term), X = front-back bidirectional, Y = left-right bidirectional, Z = up-down bidirectional. These correspond to the spherical harmonic Y₀⁰ (omni), Y₁⁻¹, Y₁⁰, Y₁⁺¹ in one common convention. By combining these in different proportions, one can synthesize the signal for an arbitrary oriented directional microphone (which is how decoding works). The audio data in Ambisonic B-format is typically stored as multi-channel WAV files. For example, a “4-channel WAV” could be a first-order ambisonic recording. The format itself is just PCM, but it’s understood that the four channels have this special meaning. Modern Ambisonic files often use a .amb or .wav with metadata indicating Ambisonic content. The widely accepted modern convention is ACN (Ambisonic Channel Number) ordering and SN3D normalization (aka AmbiX), as noted, which is used by Google’s VR/YouTube 360 platform (YouTube, Ambisonics and VR – The Blog of Bruce). One challenge historically was that different software/hardware used different orderings (FuMa versus SID versus ACN), requiring careful channel remapping in some cases (YouTube, Ambisonics and VR – The Blog of Bruce). This has been largely resolved by the community standardizing on AmbiX for new content (Ambisonic data exchange formats - Wikipedia) (“most modern applications use ACN and SN3D” (Ambisonic data exchange formats - Wikipedia)).

Ambisonics Pros: Ambisonics provides format-agnostic playback. A single Ambisonic master can be rendered to any speaker layout – stereo, binaural, 5.1, 22.2, you name it – by appropriate decoders. It is inherently 3DoF (three degrees of freedom) immersive, meaning the listener can yaw/pitch/roll rotate their listening orientation and the scene can be correspondingly rotated, which is trivial in the Ambisonic domain (a simple mathematical rotation of the harmonics) (YouTube, Ambisonics and VR – The Blog of Bruce). This makes it great for VR, where head-tracking is needed. Another advantage is that Ambisonics can be recorded directly using special microphones (e.g., Sennheiser AMBEO VR mic, SoundField mic), capturing a real environment in 360 sound, which is useful for ambiance or live VR content. Additionally, Ambisonic mixes can be easily downmixed or upmixed; for example, a first-order file can be decoded to stereo seamlessly, or a higher-order can be decoded to first-order if needed for compatibility. Ambisonics is also an open technique – no proprietary licensing to use the format itself (though certain mic or decoder algorithms might be proprietary, the concept is public domain).

Ambisonics Cons: Ambisonics requires a decoding step and careful calibration to reproduce accurately, which can be technically challenging. Without a proper decoder (or if decoded to the wrong speaker layout improperly), the result will be poor or weird. The sound spatial resolution of first-order is also limited – FOA can produce a general sense of direction but the soundfield is not pin-point sharp, especially for higher frequencies which tend to blur due to the limited spherical harmonic order. To get very sharp imaging, one needs higher-order Ambisonics (third order or above), but that means many more channels (e.g., 16 channels for third-order) and heavier processing. This is a trade-off in storage and CPU. Another con historically noted is compatibility: not all media formats or players natively support Ambisonic files. Support is growing (YouTube, game engines like Unity, Facebook 360, etc., do support FOA audio), but it’s nowhere near as ubiquitous as stereo or 5.1 support (A comparison between Ambisonic and Dolby Atmos - KMR Studios AB). This means custom workflows (e.g., injecting metadata into MP4 for 360 videos) are sometimes needed. Finally, working with Ambisonics in production requires understanding the encoding/decoding process; it is less intuitive than simply panning sounds to speakers or to XYZ coordinates, which can raise the learning curve for sound engineers.

Standardization Issues: As mentioned, multiple Ambisonic conventions historically existed, causing confusion. There were different normalizations (SN3D, N3D, FuMa, etc.) and channel orders. A piece of Ambisonic content had to come with documentation of its format or it might be decoded incorrectly. The community has largely converged, but even today, one must ensure that the Ambisonic files use the expected format for the target platform (e.g., YouTube expects AmbiX FOA W-Y-Z-X ordering (YouTube, Ambisonics and VR – The Blog of Bruce)). Another issue is that there isn’t a single widely adopted consumer delivery format for Ambisonics beyond the niches of VR. For instance, no streaming music or movie service (outside VR) delivers Ambisonics to end users; it’s usually decoded to something (binaural or 5.1) before delivery. So Ambisonics is more of a production/interchange format or a VR-specific distribution format, rather than something an average consumer sees by name.

Dolby Atmos (Object-Based Immersive Audio)

Dolby Atmos is a proprietary immersive audio format introduced by Dolby Laboratories. Unlike channel-bound formats, Atmos uses an object-based approach combined with defined speaker “beds”. In professional cinema format, an Atmos mix can contain up to 128 discrete tracks: typically a 9.1 channel bed (which includes traditional surround channels plus overhead speakers configured as 9.1) and up to 118 additional audio objects that can be placed freely in a 3D space (Dolby Atmos Cinema Sound - Dolby Professional). These objects are accompanied by metadata that specifies their position (and can vary over time, enabling moving sounds). During playback, the Atmos renderer takes into account the actual speaker layout of the system and renders each object appropriately, ensuring the intended spatial placement is achieved whether there are 12 speakers or 30 speakers.

Technical Encoding: In theaters, Atmos is packaged within a DCP (Digital Cinema Package) as a special track file. For home theater, Atmos can be carried in two main ways: (1) Dolby TrueHD + Atmos (used on Blu-ray discs), which is a lossless audio stream with embedded Atmos metadata, and (2) Dolby Digital Plus with Joint Object Coding (DD+ JOC), a lossy stream used in streaming services to deliver Atmos over limited bandwidth. A newer codec, Dolby AC-4, is also capable of carrying Atmos for broadcast or streaming, and is part of some broadcast standards. It’s important to note, as an industry article clarifies, “‘Dolby Atmos’ is an umbrella term for Dolby’s immersive sound experience... content can be transmitted via various codecs (e.g. Dolby Digital Plus, AC-4)” (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!). The Atmos metadata itself includes the 3D position (azimuth, elevation, distance or gain) of objects, along with possibly size or divergence (for spread of sound), and grouping info (for example, tagging dialog objects so they can be boosted for accessibility). On the playback side, a Dolby Atmos decoder is essentially a piece of software/firmware that knows how to interpret this object metadata and mix the audio objects into the available outputs.

Dolby Atmos Pros: Dolby Atmos is highly scalable and flexible. A single Atmos mix can play on many setups, from a large cinema with dozens of speakers down to a soundbar with upward-firing drivers, or even to binaural headphone output (as done in Dolby Atmos for headphones on Windows/Xbox, or Apple’s Spatial Audio on AirPods). This scalability is a huge advantage for content creators, as they can create one immersive mix that adapts to each listener’s environment. Atmos is also widely supported: it has become a de facto standard for immersive audio in film and increasingly in music streaming (e.g., Apple Music, Tidal offer music in Atmos). Many devices (AV receivers, TVs, smartphones, headphones) now support decoding Atmos. From a creative standpoint, Atmos allows very precise placement and movement of sounds without being limited to speaker positions. A sound can originate at an exact point in space, giving a realistic experience (e.g., a helicopter truly sounding like it moves over your head). Another pro is that Atmos content can be made interactive to some extent; for example, in gaming, Atmos can tie into game engines such that sounds are emitted as objects relative to the player. Even for movies, the renderer can adjust (within limits) to each theater, so there’s consistency of experience (Dolby highlights that every seat in an Atmos theater gets a nearly optimal mix since the rendering is spatially aware (Dolby Atmos Cinema Sound - Dolby Professional)).

Dolby Atmos Cons: The cost and complexity of Atmos is a significant factor. As a proprietary technology, producing Atmos content typically requires Dolby’s tools or certified workflows (e.g., the Dolby Atmos Production Suite in Pro Tools, or use of Dolby’s Rendering and Mastering Unit for theaters) (A comparison between Ambisonic and Dolby Atmos - KMR Studios AB) (A comparison between Ambisonic and Dolby Atmos - KMR Studios AB). Licensing and hardware requirements mean it’s not trivial for independent creators to author an Atmos mix without investing in the proper software, and for consumers to experience it, they need Atmos-compatible equipment. Another limitation is that while Atmos is flexible, it is not easily editable once rendered – the post-production flexibility of rotating or re-mixing like Ambisonics offers is not inherent. If you want to significantly change an Atmos mix (say rotate all sounds 90°), you’d typically need to go back into the DAW and adjust object automation, rather than apply a simple transform. In contrast, Ambisonics allows such transforms on the scene after the fact (A comparison between Ambisonic and Dolby Atmos - KMR Studios AB). Atmos also has some constraints: there is a finite limit to number of simultaneous objects (118 for cinema, and in practice fewer for home applications due to decoder limitations). If a scene tries to use more objects, sounds must be combined (baked) into channel beds. Thus, extremely complex scenes might lose some object benefits if they exceed the count. Standardization and openness is another issue: Atmos, being Dolby-owned, is not an open standard. Competing systems (like MPEG-H) can’t directly create a Dolby Atmos stream without Dolby’s involvement. For broadcasters or streaming services, adopting Atmos means either licensing Dolby tech or sticking to open alternatives. As noted in a comparison: “MPEG-H Audio is an audio codec, while ‘Dolby Atmos’ is an umbrella term... Dolby is more widely known due to cinema presence, but MPEG-H is becoming relevant due to standardization” (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!) (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!). Some industries view Atmos’s dominance critically since it gives one company a lot of control (for instance, in broadcasting standards battles, Dolby’s AC-4 (with Atmos) competed with MPEG-H and sometimes faced resistance (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!)).

Standardization: While Dolby Atmos itself is proprietary, there have been moves to create standardized ways of exchanging object-based mixes. The Audio Definition Model (ADM), an open metadata model (ITU BS.2076), can represent an Atmos mix in a generic way (channels, objects, etc.). Dolby and others have embraced ADM for interchange – a professional Atmos mix can be exported as an ADM BWF file, which is essentially a Broadcast WAV containing all audio channels plus a metadata XML defining the objects and scene. This ADM file can theoretically be rendered by non-Dolby systems (e.g., an MPEG-H renderer) as well. However, in consumer distribution, there isn’t an open equivalent – if a movie says “Dolby Atmos”, under the hood it’s a Dolby bitstream. This is an area where MPEG-H as an open standard is positioned as an alternative.

MPEG-H 3D Audio (Hybrid: Channels, Objects, and Scene-Based)

MPEG-H 3D Audio is an open standard for immersive audio coding, developed by Fraunhofer IIS and the MPEG committee (ISO/IEC). It was designed to be a superset of spatial audio paradigms, meaning it can handle traditional channel-based mixes, object-based audio, and scene-based audio (Ambisonics) within one unified system (What is MPEG-H? Everything you need to know - SoundGuys). Published in 2015, it became part of the ATSC 3.0 and DVB next-generation audio standards for television and is also used in certain streaming and music services (What is MPEG-H? Everything you need to know - SoundGuys). Notably, MPEG-H is the codec behind Sony 360 Reality Audio, a music format that allows placement of instruments around the listener.

Technical Encoding: MPEG-H is an audio compression codec like AAC or Dolby Digital, but much more advanced in capability. It supports sample rates up to 96 kHz and can code up to 128 “signals”. These signals can be channel signals, object signals, or higher-order Ambisonic components. For example, an MPEG-H stream might carry a 7.1.4 bed (12 channels), plus 5 objects, plus a first-order Ambisonic ambient bed, all simultaneously. The MPEG-H metadata (typically in an accompanying Audio Scene Description XML or within the bitstream) describes what each signal is (channel vs object vs HOA) and the object trajectories or HOA orientation. It supports up to 64 speaker output channels for rendering (What is MPEG-H? Everything you need to know - SoundGuys) (far beyond typical home needs, more targeting pro setups or future use). Also, MPEG-H is built for interactivity – the bitstream can contain preset mixes and allow the end-user to tweak audio. For instance, a sports broadcast could carry multiple commentary languages as objects, or allow the user to adjust the level of commentary vs stadium sound. The decoder will then personalize the mix in real-time. The bitrate for MPEG-H content can vary depending on the complexity (it’s a lossy codec typically, though can be near-transparent at high bitrates). In broadcast, it might operate in the range of 384 kbps to 768 kbps for immersive sound, whereas music streaming might use lower if needed.

MPEG-H Pros: The chief advantage of MPEG-H is its versatility and openness. It is a standardized, royalty-managed format (available for licensing through patent pools (Via Licensing Launches MPEG-H 3D Audio Licensing Program)) rather than a single-company proprietary system. This has led some countries to choose MPEG-H for broadcasting to avoid vendor lock-in (e.g., South Korea’s UHD TV audio is MPEG-H, and Brazil recently chose MPEG-H over Dolby’s AC-4 in their TV 3.0 standard after a comprehensive test (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!) (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!)). From a technology perspective, MPEG-H can do essentially everything Dolby Atmos does (object-based immersion with flexible rendering) and more in some cases. It natively supports Higher-Order Ambisonics (HOA) coding – meaning it can efficiently carry an Ambisonic soundfield. This could be useful for VR content or for capturing venue ambience. It also has integrated binaural rendering capability: the standard includes tools for binaural output, allowing a decoder on a headphone device to render the content in 3D audio for listeners without any speakers (What is MPEG-H? Everything you need to know - SoundGuys). The ability for user interactivity (choosing preset mixes, adjusting levels) is a feature Dolby Atmos only introduced in a limited way (Dolby calls them “Presentations” or objects can be tagged as dialog with an adjustable dialog control, but MPEG-H’s approach is more generalized) (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!). Furthermore, MPEG-H being an audio codec means it’s efficient in compression – it can deliver immersive audio in constrained bandwidth scenarios where uncompressed or less efficient formats would not work. It’s also designed to fold down to stereo or 5.1 gracefully if needed, ensuring backward compatibility.

MPEG-H Cons: Despite being technically robust, MPEG-H has had relatively limited adoption in consumer content compared to Dolby Atmos. Outside of certain regions (South Korea broadcast, some trials in Europe, and specialized music services), you won’t find much MPEG-H content yet. This means device support is not as widespread – many AV receivers and TVs might not recognize MPEG-H streams (though support is increasing through standards compliance). Another con is the complexity: as a do-it-all format, the tools to produce MPEG-H content are not yet as commonly integrated into digital audio workstations as Dolby’s tools. However, Fraunhofer has released an MPEG-H Authoring Suite to facilitate production, including plugins for DAWs like Pro Tools and Reaper (Out now: The Fraunhofer MPEG-H Authoring Suite). Still, the learning curve for sound engineers to use MPEG-H features (like setting up interactivity, etc.) is a consideration. In terms of performance, one could also note that any object-based system including MPEG-H will introduce some processing latency in decoding (since it must render objects, perhaps perform binaural HRIR convolution, etc.), but on modern hardware this is usually manageable.

Standardization: MPEG-H’s big selling point is that it is the standard (an official ISO standard). It has been adopted in 3GPP for audio in mobile streaming, in ATSC 3.0 (an option alongside AC-4), and in DVB. As such, it avoids the ambiguity of what “Atmos” means (Atmos can ride on AC-4 or E-AC3, etc. – the Vrtonung article noted confusion where consumers don’t know if Atmos content is using the “old” E-AC3 or new AC-4 codec (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!)). MPEG-H is more straightforward: if it says MPEG-H, the decoder knows what to do. The format war between Dolby and MPEG-H in broadcasting is ongoing, but MPEG-H recently won significant victories (like Brazil’s adoption, as mentioned) (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!). Over time, we may see a dual support scenario where devices handle both Atmos and MPEG-H, but currently Atmos has more momentum in home theater and streaming, whereas MPEG-H is establishing itself in broadcast and niche music.

Other Formats (Briefly): For completeness, note that DTS:X is another object-based format similar to Dolby Atmos (DTS’s entry into immersive audio, also allowing flexible speaker layouts). It hasn’t seen as much uptake in theatrical releases but is supported on some Blu-rays and AV receivers. Auro-3D is a channel-based 3D audio format (up to 13.1 channels) focusing on adding a height layer and the so-called “Voice of God” ceiling channel. Auro-3D was popular for certain music releases and some cinemas, but has been overshadowed by Atmos. These formats have their own pros/cons but are beyond our main scope. We focus on Ambisonics, Atmos, and MPEG-H as they represent the major paradigms (scene, object, and hybrid) that Audiocube might interface with. The table below summarizes key comparisons:

Comparison of Spatial Audio Formats

Format	Type	Channels/Content	Pros	Cons	Standardization
Ambisonics (B-format)	Scene-based (spherical sound field)	Variable channels (FOA = 4 ch; higher orders up to 16+ ch) carrying spherical harmonic components, not speaker feeds.	• Full 360° capture (3D sound) [6] • Playback on any speaker layout (via decoding) [27] • Easy to rotate/manipulate scene post-recording [35] • Open format, mic recordings possible	• Requires complex encoding/decoding pipelines [6] • Limited native support in consumer devices [6] • First order has limited spatial resolution (blur) [6] • Historically many incompatible channel conventions [23]	Open approach (developed 1970s). No single company owner. Standards via ITU/Google for FOA (AmbiX) [35]. Widely used in VR, but not a mainstream consumer format.
Dolby Atmos	Object-based + bed channels	Up to 128 tracks: e.g. 7.1.2 or 9.1 “bed” + dozens of audio objects with 3D metadata. Rendered to up to 64 speakers (cinema) or headphone output.	• Precision: objects placed anywhere in 3D space [40] • Scalable to different setups (theater, home, headphones) [8] • Widespread adoption (films, music, streaming) [31] • Many tools and workflow integrations available	• Proprietary (Dolby licensing needed) [40] • Expensive to implement professionally (hardware, license) [21] • Limited post-processing flexibility (fixed metadata once rendered) [40] • Consumer confusion on format (Atmos label covers multiple codecs) [21b]	Dolby proprietary format (2012). De facto industry standard in cinema; part of Blu-ray, streaming (via Dolby TrueHD, Dolby Digital+). Broadcast Atmos usually via Dolby AC-4 (standardized in ATSC/DVB). Requires Dolby decoder for playback.
MPEG-H 3D Audio	Hybrid: Channel, Object, Scene (HOA)	Configurable: supports traditional channel sets (up to 7.1.4 and beyond), up to 128 streams which can include audio objects (with XYZ metadata) and Higher-Order Ambisonics components. Renderer outputs to 2–64 speaker channels, including binaural render option.	• Very flexible & future-proof (channels + objects + ambisonics) in one [38] • User interactivity (choose dialog level, etc.) supported [21b] • Open standard (ISO/IEC) – not locked to one vendor [17] • Efficient compression for delivery (suitable for broadcast/mobile)	• Lower market penetration so far (less content available) • Limited support on consumer devices currently (gaining in TVs, soundbars) • Production tools not as ubiquitous as Dolby’s (learning curve for new workflows) • Still requires licensing (patent pool) for encoders/decoders (though more accessible than Atmos)	International standard (ISO 23008-3:2015). Adopted in ATSC 3.0 (optional) and DVB. Used in South Korea UHD broadcast, Brazil TV 3.0, and 360 Reality Audio music. Open bitstream and ADM-based production format [38]. Competes with Dolby Atmos/AC-4 in standards.

Reference List

ITU-R. (2001). Audio Definition Model for Immersive Audio (Recommendation BS.2076-3). Retrieved from https://www.itu.int/rec/R-REC-BS.2076-3/en
Gerzon, M. A. (1974). Ambisonic Surround-Sound Recording: Principles and Recent Advances. Journal of the Audio Engineering Society, 22(10), 886–895.
Google Developers. (2016). Resonance Audio: Spatial Audio SDK. Retrieved from https://developers.google.com/resonance-audio
AmbiX. (2016). AmbiX: An Open Standard for Ambisonics. Retrieved from https://www.ambixtools.org
Dolby Laboratories. (2012). Dolby Atmos: Immersive Sound Experience. Retrieved from https://www.dolby.com/technologies/dolby-atmos/
Rumsey, F. (2001). Sound and Recording: An Introduction. Taylor & Francis.
McKeag, D. (2014). 3D Audio in Film and Television. Sound on Sound. Retrieved from https://www.soundonsound.com
Begault, D. R. (1994). 3-D Sound for Virtual Reality and Multimedia. Academic Press.
Fraunhofer IIS. (2015). MPEG-H 3D Audio. Retrieved from https://www.iis.fraunhofer.de/en/ff/amm/tech/mpegh.html
Pulkki, V. (1997). Virtual Sound Source Positioning Using Vector Base Amplitude Panning. Journal of the Audio Engineering Society, 45(6), 456–467.
Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization. MIT Press.

(Table: A comparison of Ambisonics, Dolby Atmos, and MPEG-H 3D Audio spatial formats, highlighting their type, content structure, advantages, disadvantages, and standardization/adoption status.)

Audiocube in Unity: Integrating and Exporting Spatial Audio Formats

Audiocube is a Unity-based 3D Digital Audio Workstation that allows creators to design sound in a virtual environment. Under the hood, Audiocube does not rely on Unity’s built-in audio engine for spatialization; instead, it uses a custom spatializer and acoustics engine developed by its creator (Nuxt HN | Show HN: Audiocube – A 3D DAW for Spatial Audio). This custom engine simulates 3D audio propagation (including reflections, occlusion, etc.) beyond Unity’s default capabilities. The current version of Audiocube focuses on binaural output – it renders the 3D scene to a stereo binaural mix for headphone listening in real-time. In fact, as of now Audiocube “only exports/plays back in binaural stereo” (Nuxt HN | Show HN: Audiocube – A 3D DAW for Spatial Audio), and multichannel output is under development. To integrate with mainstream spatial formats, Audiocube will need to capture its 3D audio scene and convert or export it into those formats. Here we explore how Audiocube could interface with Ambisonics, Dolby Atmos, and MPEG-H, respectively, and what recommendations can be made for capturing its spatial audio output.

Exporting Ambisonic (B-Format) Audio from Audiocube

One natural path for Audiocube to support a spatial format is through Ambisonics. Unity itself has some support for Ambisonic playback – for example, Unity can import B-format audio clips and decode them to the listener’s speakers or to the selected spatializer. However, by default Unity’s AudioListener (which records the mix) will downmix everything to stereo in built-in recordings (Getting Spatial Audio from Unity Session Recorders – VRAASP Project). In fact, Unity’s documentation confirms that the audio output is always down-mixed stereo unless a specific multichannel mode is used, and it won’t produce Ambisonics out-of-the-box (Getting Spatial Audio from Unity Session Recorders – VRAASP Project). Since Audiocube bypasses Unity’s audio engine, it has more freedom to implement a custom solution.

To export Ambisonics, Audiocube could implement an Ambisonic encoder that takes all the virtual audio objects in the scene and mixes them down to a set of B-format channels. This would involve summing contributions of each sound source into the W, X, Y, Z channels (for FOA) according to the direction of each source relative to the listener. Essentially, Audiocube’s audio engine would perform the mathematical encoding (using the spherical harmonic formulas) in real-time or offline. There are libraries and algorithms available (e.g., Google’s Resonance Audio SDK had ambisonic encoding features, and there are open-source implementations of Ambisonic panners). By integrating such an encoder, Audiocube could output a 4-channel WAV that represents the 3D audio scene in first-order Ambisonics. If higher fidelity is needed, higher-order encoding (with more channels) could be offered as an option, though FOA is the most universally supported (YouTube, etc.). The resulting Ambisonic file could then be used in various ways: for instance, merging with 360° video content (with the proper metadata) for YouTube VR, or decoded in other applications for playback on speaker arrays, or further post-processed.

An alternative way to get Ambisonic output (if not implemented internally) is to use an external spatial audio plugin. Unity developers sometimes use the Google VR/Resonance Audio toolkit or Facebook’s 360 Audio tools, which can capture a Unity scene to Ambisonics. For example, Resonance Audio provided a feature where you could place a “Recorder” in the scene that outputs a first-order Ambisonic file instead of stereo. Audiocube’s custom engine might not directly use Resonance’s pipeline, but conceptually it could provide a similar feature.

Recommendation: Provide a “Spatial Audio Recorder” within Audiocube that can be toggled to capture the session’s audio into Ambisonic B-format. This recorder would generate a multi-channel file (FOA .wav) in AmbiX format (W, Y, Z, X ordering, SN3D norm) which is the industry standard (YouTube, Ambisonics and VR – The Blog of Bruce). The user can then take this file into external workflows – for example, load it into Reaper or Pro Tools with a Ambi decoder to output to speakers, or upload as part of 360 video content. This would immediately give Audiocube users a way to get a full 3D audio recording that is not locked to stereo. It’s a relatively straightforward addition given Audiocube’s control over its audio pipeline, and it leverages the open nature of Ambisonics (no licensing issues to worry about). It’s also a stepping stone to supporting other formats, since both Atmos and MPEG-H could potentially ingest Ambisonic content in some form if needed (MPEG-H supports HOA natively, Dolby Atmos tools can accept Ambisonics for certain uses like spatial ambient beds).

Additionally, Ambisonics could be used within Audiocube as an interchange format – e.g., to import Ambisonic soundfields recorded elsewhere and include them in the scene. Unity already supports importing Ambisonic audio and decoding to binaural (with appropriate plugins), so Audiocube could likewise allow an ambisonic audio asset to be a sound source that emits a full 3D ambience.

Audiocube to Dolby Atmos Workflow

Supporting a direct Dolby Atmos export from Audiocube is more challenging, given Atmos’s proprietary nature and complex toolchain. However, Audiocube can facilitate an intermediate export that can be brought into a Dolby Atmos production environment. There are a couple of approaches:

1. Multichannel Bed Export: The simplest (though somewhat limiting) approach is to export Audiocube’s mix as a fixed 7.1.4 channel layout (which is the most common speaker layout used in home Atmos as a bed). This would be a 12-channel WAV (or 12 separate WAV files) corresponding to a standard Atmos bed configuration (7 ear-level speakers, 4 overheads, 1 LFE). Audiocube could internally render the scene to this 7.1.4 speaker configuration by positioning a virtual speaker output arrangement and summing the audio to those. Once you have a 7.1.4 mix, that can serve as the base bed for a Dolby Atmos mix. The user or an engineer could then import that into a Dolby Atmos authoring tool (like import into Pro Tools with the Dolby Atmos Production Suite) and use it directly as the bed. However, what about the individual objects? This method flattens everything into channels, so it loses the discrete object advantage – effectively it would be like delivering a “Dolby Atmos bed-only” mix. It would still work and be playable as Atmos (Atmos allows just a bed without separate objects), but it doesn’t fully realize what Atmos can do.

2. Stems and Metadata for Objects: A more advanced approach is to export each sound source (or groups of sources) from Audiocube separately along with an animation of their positions over time. For instance, Audiocube could output a set of mono audio files, each representing one object’s audio content (e.g., an instrument or a sound emitter), and a metadata file (possibly using the ADM format or a Dolby Atmos ADM BWF) that contains the trajectory (position keyframes) of these objects. Essentially this would be generating an ADM file with object metadata, which is a format that Dolby’s Renderer can ingest. ADM (Audio Definition Model) is an XML schema standardized by ITU to describe audio objects, channels, etc., in a scene (MPEG-H Audio vs. "Dolby Atmos" - there is a winner!). In practice, a Dolby Atmos workflow can import an ADM BWF created elsewhere. So if Audiocube could package its project audio into an ADM file, that might be the most seamless handoff: the user could take that ADM file and open it in a Dolby Atmos mastering workstation, and it would appear as the set of objects and beds ready to be rendered to a true Atmos deliverable.

Implementing ADM export would require Audiocube to generate an XML describing all audio “objects” with their properties. Since Audiocube internally knows every sound’s position and the audio signal, it’s quite feasible: for each sound, treat it as an audio object, capture its dry audio (not binaural-processed, but the raw source or maybe including distance attenuation but not HRTF since Atmos doesn’t need HRTF at source) and record its position timeline. This could then be written into an ADM file referencing the audio files. The format is complex but there are libraries and examples (the EBU ADM toolkit, etc.). There may also be simpler subset approaches if not all Atmos features are needed (e.g., maybe treat everything as objects, or designate some as part of a bed).

3. Real-time Pipeline via External Renderer: Another possibility is to integrate a third-party audio middleware like Audiokinetic Wwise or Dolby’s own game audio plugin if available, which can render directly to Atmos. Wwise, for example, has support for Atmos where it can output object metadata if used with a Dolby Atmos enabled pipeline. If Audiocube could send its audio to Wwise or another Atmos encoding runtime, it could capture an Atmos master file. However, this is likely overkill for a workstation like Audiocube and would complicate usage.

Given the above, the recommended approach is to implement an ADM BWF export function. This way, Audiocube doesn’t need to have a Dolby encoder per se; it just needs to output an open description of the audio scene (like an interchange format). The user can then use Dolby’s free or paid tools to convert that into the final packaged Atmos for distribution. This keeps Audiocube’s solution open (not tied to Dolby internals, except relying on ADM which is open).

Concretely, the workflow could be: The user performs their sound design in Audiocube’s 3D space. When ready to export spatial audio, they choose “Export to Atmos/MPEG-H (ADM)”. Audiocube then runs through the timeline, records each object’s track (maybe each track in Audiocube corresponds to an object or a group if too many sources), writes out an ADM XML with those object definitions and their XYZ positions over time, and outputs audio files for each. The user then opens this in a DAW with Dolby Atmos Production Suite – since Dolby’s tools do support importing ADM files to create an Atmos mix session (The Audio Definition Model and Delivery of next) – and from there, they can refine or directly export to a deliverable Atmos master (e.g., a .atmos file for theaters or .mp4 with DD+JOC for streaming, etc.).

It’s worth noting that currently Audiocube is focused on music/spatial sound design rather than syncing with picture, so an Atmos export would likely be aimed at music or VR soundscapes rather than film mixes. This is fine, as Atmos Music is an emerging area (supported by Apple, Amazon, Tidal). Atmos music typically uses fewer objects (often the bed plus a dozen or so objects for instruments). Audiocube could actually be a very intuitive front-end for creating such mixes, if it allowed placement of instruments in a 3D space and then export to Atmos.

Finally, for headphone users, one might ask: why bother with Atmos if Audiocube already makes binaural? The reason is compatibility – if someone creates a spatial audio piece in Audiocube and wants it available on Apple Music or played on an Atmos sound system, doing an Atmos export is necessary. For example, Apple’s Spatial Audio with AirPods is effectively Dolby Atmos under the hood; to get Audiocube content on there, you’d need to go through an Atmos workflow. So this could be a valuable feature.

Audiocube to MPEG-H 3D Audio Workflow

Exporting to MPEG-H could be approached in a similar vein to Atmos, given that MPEG-H also supports an ADM-based workflow. In fact, the ADM model was in part developed to ensure interoperability between systems like MPEG-H and Atmos. An ADM file from Audiocube could potentially be fed either into a Dolby workflow or into an MPEG-H workflow. Fraunhofer provides the MPEG-H Authoring Suite (MAS), which includes an ADM import tool and an MPEG-H encoder (Learn - MPEG-H Audio). Audiocube users could use this to create an MPEG-H bitstream from the same export. Alternatively, Audiocube could integrate directly with Fraunhofer’s MPEG-H encoder libraries (if available) to create an MPEG-H file.

Let’s outline the integration: Audiocube, by exporting an ADM, already covers the description of channels/objects. MPEG-H specifically can also carry Higher Order Ambisonics, so another route is that Audiocube could export a higher-order Ambisonic mix (if Audiocube’s engine can synthesize HOA). But probably sticking to objects via ADM is simpler and more general.

If the aim is to produce, say, a 360 Reality Audio file (which is an application of MPEG-H for music), Audiocube could facilitate that. 360 Reality Audio uses MPEG-H to allow headphone playback of music where instruments are placed around the listener. That system expects perhaps an MPEG-H MP4 file with the audio objects. Audiocube’s spatial interface is quite fitting for authoring such content. So an export to MPEG-H feature could target generating an .MP4 file with an MPEG-H audio bitstream. This could possibly be done by invoking the Fraunhofer encoder (the MPEG-H Authoring Suite includes a command-line encoder that could be automated, subject to licensing).

Recommendation: Use the same ADM export from the Atmos section to double-serve for MPEG-H. Ensure the ADM includes all necessary info (likely it will, as ADM is quite exhaustive). Then provide guidance or integration for encoding: for instance, Audiocube could detect if the Fraunhofer encoder is installed and call it, or simply output instructions: “Use the MPEG-H Authoring Tool to import the ADM and generate MPEG-H.” Perhaps Audiocube could even bundle a simplified encoder if allowed. At minimum, documentation for users on how to take the ADM export into MPEG-H MAS would be useful.

Another approach: Audiocube could output an HOA Ambisonic track as intermediate, since MPEG-H can take HOA input. If, for example, Audiocube finds it easier to just record a 3rd order Ambisonic mix of the scene (maybe easier than handling dozens of objects), it could do that. A 3rd order mix (16 channels) capturing the whole scene could be encoded in MPEG-H as scene-based audio. The advantage here is that it captures the entire sound field in one go. The disadvantage is you lose the ability to edit individual elements or have interactive control after the fact (it becomes like a “live recording”). For certain use cases (like a static soundscape) this might be fine and simpler. However, since Audiocube’s strength is in interactive objects and moving elements, preserving them as objects is likely better.

One consideration: MPEG-H vs Atmos metadata differences. While both can be described by ADM, there might be specific features like MPEG-H’s “preset” (alternate mixes) that aren’t directly in Atmos. Audiocube probably doesn’t need to worry about those advanced features initially; focusing on getting the basic objects/beds out is enough.

Capturing Audiocube’s 3D Audio – General Recommendations

In summary, to capture and output Audiocube’s 3D audio in the mentioned formats, the following recommendations emerge:

Implement Multi-Channel Capture in Audiocube: Upgrade Audiocube’s export options beyond stereo. At minimum, support recording a multichannel WAV of the scene. Ideally, support common layouts (quad, 5.1, 7.1, etc.) and especially Ambisonics (First Order) as an export option. This aligns with the developer’s note that multichannel is being worked on (Nuxt HN | Show HN: Audiocube – A 3D DAW for Spatial Audio). Focus on Ambisonics first, as it covers full 3D and can be a feed into other pipelines (plus it’s immediately useful for VR/360 videos).
Provide Open Scene Description Export (ADM): Add an export feature (possibly named “Export Spatial Mix”) that generates an ADM BWF. This will include writing out either one polyphonic WAV containing all channels or multiple wavs, plus a metadata XML. The export could allow choosing a target configuration, e.g. “Export as objects (ADM)” or “Export as 7.1.4 bed”. ADM gives flexibility and ensures Audiocube isn’t locked to one format’s proprietary structure.
Leverage External Tools for Final Encoding: Since direct encoding to Dolby Atmos or MPEG-H from within Audiocube could be burdensome, the strategy should be to integrate with external authoring tools. For Atmos, the Dolby Atmos Production Suite (in DAWs) or the Dolby Atmos Renderer could be the next step. For MPEG-H, the Fraunhofer MPEG-H Authoring Suite would be used. Audiocube’s documentation or UI can guide users: e.g., after export, instruct “To create a Dolby Atmos file, import the exported ADM.wav into Dolby Atmos Renderer.” Possibly, Audiocube could automate launching these if installed, but that might not be necessary.
Testing and Calibration: When implementing Ambisonic or channel-based export, calibration is key. For Ambisonics, verify that Audiocube’s output matches the expected convention (for instance, place a known sound due front, ensure it appears in the appropriate combination of W,X,Y,Z). For channel layouts, ensure panning in Audiocube corresponds to correct channel balances in the exported layout. Unity’s own limitation (downmix to stereo) taught us that capturing spatial mixes isn’t trivial (Getting Spatial Audio from Unity Session Recorders – VRAASP Project) – one must bypass or override the standard audio pipeline. Audiocube’s custom engine already bypasses Unity, so it likely manages individual source signals. The capture could be done by routing those signals to a recording buffer.
Real-Time vs Offline Export: Decide if export is offline (faster-than-realtime rendering without user hearing, like bouncing a mix) or real-time (record as it plays). Offline would be preferable for accuracy and speed. Audiocube could simulate the physics and audio for each frame of the timeline and sum to the outputs without actually outputting to DAC, which ensures a clean capture unaffected by any real-time hiccups.
Binaural Rendering Remains for Monitoring: Even after adding these formats, Audiocube will likely keep binaural stereo as the primary monitoring output (since most users will use headphones while designing in the app). The spatial formats export is an additional step for distribution or further mixing. One might also consider adding binaural render for Atmos as a monitoring option: e.g., if a user plans for an Atmos output, they might want to monitor in binaural approximating a speaker layout. Dolby actually provides an HRTF-rendered preview of Atmos mixes in their tools. Audiocube could adopt a generic HRTF approach to simulate, say, a room of speakers during design. But this might be too much detail; simply acknowledging that the binaural output is for monitoring and quick demos, whereas the exported spatial format is for professional output.
Occlusion/Reflection in Exports: Audiocube’s engine simulates environmental effects (reflections, etc.). How to capture these? If exporting object-based, the reflections are part of the audio of each object (or might be separate objects if the engine generates echo objects). If exporting Ambisonics, the reflections and reverb are naturally summed into the sound field capture. That should be fine, just ensure the spatial nature of reverb is preserved (perhaps the reverb in Audiocube could be decoded to Ambisonics natively by using an Ambisonic reverb or convolution, etc., to truly surround the listener).
Validation: Validate the exported files by re-importing them into known good reference environments. For Ambisonics, for example, play the exported .amb file through a known Ambisonic decoder to binaural and see if it matches Audiocube’s own binaural. For ADM, import it into a DAW with Dolby Atmos and check that the spatial positions of objects match where they were in Audiocube. This will flush out any coordinate system mismatches (for instance, Audiocube’s coordinate axes vs ADM’s coordinate definitions might need mapping — e.g., what Audiocube considers “forward” might need to map to 0° azimuth in ADM, etc.).

By folliwing these recommendations, Audiocube could become one of the first 3D DAWs that not only creates spatial audio content easily, but also outputs it in the formats required for distribution – bridging the gap between creative sound design and industry-standard delivery.

Conclusion

Spatial audio formats have evolved from experimental beginnings (stereo, surround) into sophisticated modern systems that can envelop listeners in three-dimensional sound. We reviewed how these formats developed historically – from early binaural demos in the 19th century, to Ambisonics in the 1970s, to today’s object-based standards like Dolby Atmos and MPEG-H. We then delved into the technical underpinnings of spatial audio storage: the differences between channel-based mixes, object-based audio with metadata, scene-based Ambisonic representations, and the unique role of binaural audio for headphone playback. A comparative analysis of Ambisonics, Atmos, and MPEG-H highlighted that each format has distinct advantages (flexibility, precision, openness) and trade-offs (complexity, compatibility, ownership).

For a tool like Audiocube – a cutting-edge 3D audio workstation – supporting these formats is the next logical step to empower creators. We explored how Audiocube, built on Unity with a custom audio engine, can integrate with spatial formats. The key lies in capturing the rich 3D audio scene Audiocube generates and converting it into standard spatial audio files. Ambisonics offers an immediate route to record a full 360° soundfield from Audiocube, which can be used for VR content or further decoding. Dolby Atmos and MPEG-H, while more complex, can be reached through an open ADM workflow, allowing Audiocube to hand off its spatial sound scene to those ecosystems without locking itself into proprietary constraints. By implementing multichannel and metadata exports, Audiocube can ensure that the amazing interactive soundscapes users create can live outside the application – whether it’s in a VR video on YouTube (with Ambisonic audio), on a streaming service with Dolby Atmos, or in next-gen broadcasts with MPEG-H.

In closing, spatial audio is a rapidly growing field, and bridging creative tools with delivery formats is crucial. With the recommendations outlined – Ambisonic capture and ADM-based export – Audiocube could become a pivotal tool for audio engineers and developers, allowing them to prototype, design, and export immersive audio in a streamlined workflow. This helps push the industry toward more accessible spatial audio production, ultimately leading to richer and more immersive listening experiences for end users, regardless of platform or device.

Researchspatial audio

Noah from AudioCube