- Merlin Sunley
Stereo, Surround and Binaural Audio in Games: A short history with case studies
This article details key developments in the history of spatial audio in games and provides
three case studies on the use of spatial audio technology within a game audio context. I also examine a selection of spatial psychoacoustic phenomena, their use and potential to drive narrative and amplify a sense of immersion.
Two of the most revolutionary technologies being used ingames, binaural/HRTF audio and the Ambisonics format went unutilized for decades before their adoption into the consumer market. A brief introduction to their history and usage are
Binaural refers to hearing using two ears, but the term has evolved to include all spatial cues
from the ears, body and head of the listener (Roginska & Geluso, 2017). In 1881 the first
binaural transmission system was developed by Clement Ader. It consisted of four pairs of
microphones connected to binaural telephone receivers. These were located in four remote
listening rooms at the Paris Grand Opera. One patron noted “In listening with both ears at
the two telephones, the sound takes a special character of relief and localization which a
single receiver cannot produce” (Torick, 1998). Binaural cues are received by the eardrum
via diffraction of sound which is described by the HRTF or Head Related Transfer Function
(Vorländer, 2004). This is essentially the impulse response of a human head, ears and body.
Capturing the auditory spatial effects of the pinnae in a manner similar to the way
convolution reverb captures the impulse of a room, allowing it to be overlaid over an
unrelated audio source, colouring the sound with its unique spatialized imprint (Farnell,
Developed in the 1970’s by a team from Oxford headed by Michael Gerzon, ambisonics is
considered one of the most realistic and advanced spatial audio systems available (Murphy
& Neff, 2010). The system relies on using particular microphone patterns or specially built
Soundfield Microphones to capture audio in a way compatible with the format. Recent
advances in VR have precipitated a new interest in ambisonic audio. According to Schütze
et al (2018) this is because ambisonics is not restricted to a specific number of channels. A
larger number of channels simply provides higher directional resolution with full spherical
directionality (including elevation) being achievable with only four channels.
Spatial Audio in Video Games
Surround sound within computer games first came into use with the release of the Super
Nintendo in 1990 (Rees-Jones, 2018). The SNES utilized Dolby Surround, which meant that
games produced using Dolby encoded soundtracks could be played back with the addition of a Dolby Surround decoder (Hull, 2004). According to Mike Clarke, an audio programmer at Psygnosis, this worked as follows: With each of the SNES’s 8 channels it was possible to set the L and R output independently. If you set L to positive and R to the same value but
negative or vice versa, the phase inversion would be detected by the surround decoder and
the sound would be sent to the rear speaker. This was, according to Clarke, a “total waste of
time”. Due to phase cancellation inherent in a standard stereo mix, random noise would
often be output from the rear speaker and if the SPC700 reverb was enabled, comb filtering
from the delays would send most of the reverb to the rear. Ultimately at the time most
players experienced SNES audio through a mono TV speaker meaning that “if you
intentionally put a sound through a Dolby Surround rear channel you were also intentionally
guaranteeing that most people would hear silence instead of that sound” (Sunley, 2020).
There were some notable examples of surround sound in SNES games. The additional
surround channel was often used to play ambient sounds or music, thus adding an additional
level of immersion to the experience. An example of this is the game Secret of Mana 2.
During certain cutscenes wind and fire envelop the player, whilst during battles the sound
effects are mapped to the front speakers allowing the soundtrack to continue playing through the rear (Lara, 2013).
While earlier examples of surround audio in games are numerous, (see appendix) usage is
often limited to increasing immersion through soundscape rather than performing a useful
ludic function. Perhaps the first such examples of games to use spatial audio as a core
gameplay element are from the FPS genre. Within 3D games, spatial audio provides the
ability to localize a sound source in 3 dimensions around a player giving instantaneous and
instinctual feedback of unseen and ever changing game-states (Collins, 2008).
A key development in true spatial audio for games was Aureal’s Vortex sound card and A3D
API in 1997. The sound card used hardware acceleration to model the three-dimensional
sound environment using HRTF (Chase, 2016). Around the same time Microsoft shipped
DirectX, a series of multimedia APIs designed to improve the speed that sound and graphics
cards could communicate, granting much greater control of real-time mixing and output
(Collins, 2008). This allowed the type of signal processing required to position sound objects
within a virtual space such as distance based attenuation and doppler shift to gain
widespread usage (Collins, 2008).
This marks somewhat of a golden age for hardware accelerated 3D audio. Following the
release of Windows Vista in 2006, Microsoft discontinued the DirectSound3D API (Schmidt,
n.d.) and while it was commonly believed that hardware acceleration was no longer needed,
true spatial audio in games suffered greatly (Chase, 2016). Software implementations in
games lacked the computational power needed to create truly 3D audio. But while hardware
accelerated 3D audio declined during this period as lower quality software alternatives filled
the void, channel based surround sound began to achieve standardization within the
Standardized use of 5.1 surround in video games arrived with the Xbox in 2001. Microsoft
used the DDL (Dolby Digital Live) codec in the console, which allowed for real-time encoding and decoding of 5.1 audio (Horowitz, 2014). One of the flagship games to take advantage of this advancement was Halo: Combat Evolved. According to Marty O'Donnell the 5.1 Surround in Halo was the aspect of the audio he was most pleased with. It was not without its issues however. The primary challenge encountered involved relinquishing control of channel sends to the audio engine based on camera position. In order to get character
animations to lip sync to the relevant sound and utilize room DSP, the developers attached
sound tags to the character model whose dialogue is currently playing. This caused sound to
transmit from the appropriate speaker based on the NPC’s relative position to the player
camera. This meant that speech files would jump suddenly from front to rear during close up
shots (O’Donnell, 2002).
Flash forward to 2020 and with the increased commercial viability of VR leading to a
resurgence in the use of binaural and HRTF technology, not to mention 20 years of
standardized 5.1 sound in games, what effect have these advances had on the design of
games and what opportunities did they create for game-sound designers? In the following
section a breakdown of the use of spatial sound will be discussed in three case-studies.
These games have been chosen specifically due to praise they received for exceptional use
of stereo, 5.1 and 3D audio.
Case Study 1 - Stereo: Inside (Playdead Software)
Inside is a 2.5D puzzle platformer noted for its intense atmosphere and audio-driven
gameplay. Absent of any dialogue and with a purposefully obscurantist narrative the game
presents a series of environments in which sound is closely integrated with the process of
puzzle solving thus close attention to auditory cues in the environment is critical to success
Upon starting the game a strong connection between the player and protagonist is
established via what the developers termed the character’s Voice (Schmid, 2016). This
consists of meticulously programmed footsteps and breathing sounds presented primarily in
mono. Early in the game a heartbeat sound is slowly pushed into the stereo field just prior to
the first encounter with the “marching husks” the player is subtly propelled to a heightened
sense of tension and connection with the protagonist. This is possibly due to the subversion
of expectation caused by a gradual expansion of character sound from the centre to sides of
the stereo field. The use of a stereo heartbeat also serves a ludic function, that of
persistently looping audio providing a metronome that guides the player through a battery of timing based puzzles (a mechanic used numerous times throughout Inside) (Arnold, 2018).
The game often makes use of spatialization in order to define a sense of contrast, an
example of this is during an early puzzle in which the player must lure some chicks into a
thresher in order to knock a piece of hay off a wooden beam. Small environmental details i.e.
the chicks are a mono point source that move around the stereo field, while the room tone
and reverbs occupy a much wider space.
In fact, hard panning of sound sources is never used trivially and is often tied to a puzzle
mechanic. This further enhances the synchronisation between gameplay mechanics and
narrative, greatly deepening the sense of immersion (Ash, 2016).
Additionally, the game rarely uses non-diegetic sound, so when musical stingers are played
Andersen makes highly effective use of the stereo field, the stingers take on an almost
architectural function, defining the space the boy is inhabiting. A great example of this is not
long after the boy discovers an underwater submersible. Under the water the audio is
narrowed significantly in the mix giving a claustrophobic feel to the scene. Upon breaking
through a wooden barrier a gigantic pad sound erupts from nowhere as the wooden planks
explode and sink slowly into a colossal underwater area. This sudden switch between mono
and stereo is used to powerfully convey a sense of space.
These sorts of panning and mix decisions are too numerous to describe as throughout the
game the localization of sounds in the stereo field is purposeful. One notable example
includes the sonification and panning of moving light sources found in a number of locations.
It has already been mentioned that Andersen believed that a surround mix would actually be
detrimental to player immersion and the way in which he makes full and often highly
calculated use of the stereo field to convey relevant ludic information without breaking player immersion presents a masterclass in meaningful game audio design.
Case Study 2 - 7.1 Surround: Alien Isolation (Creative Assembly)
Alien: Isolation is a survival-horror FPS set 15 years after the events of the original Ridley
Scott film Alien. According to sound designers from Creative Assembly, audio was intended
as a core aspect of the gameplay from the project’s inception. designer Sam Cooper stated
that the entire soundtrack was designed to be “contextually emotive throughout” (Ramsay,
Based on a critical listening session it is possible to detect the use of a number of
psychoacoustic phenomena including conditioning, signal listening, the cocktail party effect
and auditory pareidolia. These are combined in various ways to manipulate the player into
undergoing an extremely tense, immersive experience.
More commonly understood as a visual phenomenon, pareidolia is the perception of patterns in randomness where none exists. For example seeing faces in inanimate objects or hearing obscured messages in music (Jaekl, 2017). This effect is used in a deliberate way in the game. This is achieved with subtle layering of alien vocalizations with the sound of closing doors or corridor groans (Broomhall, 2015). These sounds are highly localized allowing the player to accurately judge the distance and location of the sound while maintaining ambiguity.
Signal listening occurs when the brain is trained on an expected sound event (Farnell, 2010).
The sheer frequency of large ambiguous sounds such as the aforementioned corridor
groans and stochastic bursts of environmental sounds conditions the player to be constantly
vigilant. Combined with the effects of pareidolia whether intentionally implemented or not,
induces a constant state of anxiety in the player.
The Cocktail Party Effect
An area where the game truly shines is the underlying system used to power emergent
moments of tension. This comes in the form of a dynamic system designed to replicate the
ability of the ear to filter out unwanted or unexpected sounds (Broomhall, 2015). An example
of this is during moments when the player is forced to hide from the Xenomorph. Real time
data is used to drive RTPC’s within Wwise, highlighting the player foley and the alien and
attenuating ambient sounds bringing the tension into sharp focus (Broomhall, 2015).
It was a stated intention of the creators of Alien: Isolation to wield diverse psychoacoustic
phenomena in service of inducing terror in players and in this regard they are successful.
The use of responsive audio systems to replicate these low-level psychophysiological effects
is a field of audio design at a stage of relative infancy. However, even 6 years after it’s
release, Alien: Isolation presents an extremely high bar for spatialization in game audio.
Case Study 3 - Binaural: Half-Life: Alyx
Valve’s flagship VR game, Half-Life: Alyx utilizes Valve’s Steam Audio SDK (Steam DB
Team, 2020). Representing, alongside Oculus Audio and Google Resonance, a popular
resurrection of two long neglected technologies: Ambisonics and HRTF filtered positional
audio. According to Dave Feise, sound designer at Valve, one of the biggest challenges
involved was getting sonic elements from Half-Life and Half-Life 2 to fit into the VR world in a
realistic manner. Many of the sounds were simple in terms of frequency content and as such
lacked a sense of localization. In order to spatialize better with HRTF the sounds needed to
be “dirtied up” by adding more frequency content (Walden, 2020).
Two creative hurdles experienced according to designer Roland Shaw included creating a
sense of scale and detail (Walden, 2020). This works well with large set piece sounds e.g.
during the beginning of the game where a “Strider” walks across the rooftops destroying
masonry as it makes its way through City 17. It gives an impressive sense of scale as it
approaches and attenuates very realistically as it moves further away.
Half-Life Alyx - Construction Strider 14:54
The same cannot be said for the minor handling details (of which there are many). The
weight of handling environmental based objects appears to lack a sense of physicality.
There is no heft to them as if in the real world. Due to synchresis, borrowing film-sound
terminology (Chion, 1994) it is impossible to decouple the sound from its source. This is due
to the fact that the box does what you would broadly expect a box to do when you pick it up
and throw it. You know what a box is supposed to sound like and what you hear
approximates that sound. But from the authors perspective, the immersion is broken due to
the fact it is a procedurally generated, approximation of the sound of a box. This effect
relates to Gestalt principles of expectancy (McClean, 2005) indicating potential future
avenues of research into the psychological relationships between sound and vision in VR
Realism enhancing aspects of this apparently simple example could be quite
computationally wasteful. For example recreating the unique properties of the reverb
emanating from within the cardboard box. Impact resonances scalable with the size and
material composition of the impacted medium. This is already achievable with more realistic
procedural models or high resolution audio ray tracing (NVIDIA, 2019), and both
technologies are now gaining more traction as computational technology improves.
Alternatively, simple eq and saturation on the physics based sounds could be used and a
priority assigned to these effects in order that they are dialled down until idle scenes or
moments where little action or music is occurring. Ultimately though, novelty aside, these
sorts of considerations are easily overlooked. Why spend tens or hundreds of man hours
designing a system only a small percentage of players will appreciate or experience?
Half-Life: Alyx Physics Interactions
Overall, the sound of Alyx is exceptional, however it does highlight the current limitations of
game audio technology. Arguably the discontinuation of DirectSound3D in the early 00’s set
truly immersive 3D audio back by almost a decade. Games such as Alyx represent a new
era in truly spatialized game audio. Along with Google’s Resonance, Sony's inclusion of
HRTF technology in the PS5, Oculus’s Native Spatializer Plugin (ONSP) all amongst others
mark a long awaited resurrection of the technology into more common usage.