top of page
  • Merlin Sunley

Stereo, Surround and Binaural Audio in Games: A short history with case studies

This article details key developments in the history of spatial audio in games and provides

three case studies on the use of spatial audio technology within a game audio context. I also examine a selection of spatial psychoacoustic phenomena, their use and potential to drive narrative and amplify a sense of immersion.


Two of the most revolutionary technologies being used ingames, binaural/HRTF audio and the Ambisonics format went unutilized for decades before their adoption into the consumer market. A brief introduction to their history and usage are

outlined below.


Binaural/HRTF


Binaural refers to hearing using two ears, but the term has evolved to include all spatial cues

from the ears, body and head of the listener (Roginska & Geluso, 2017). In 1881 the first

binaural transmission system was developed by Clement Ader. It consisted of four pairs of

microphones connected to binaural telephone receivers. These were located in four remote

listening rooms at the Paris Grand Opera. One patron noted “In listening with both ears at

the two telephones, the sound takes a special character of relief and localization which a

single receiver cannot produce” (Torick, 1998). Binaural cues are received by the eardrum

via diffraction of sound which is described by the HRTF or Head Related Transfer Function

(Vorländer, 2004). This is essentially the impulse response of a human head, ears and body.

Capturing the auditory spatial effects of the pinnae in a manner similar to the way

convolution reverb captures the impulse of a room, allowing it to be overlaid over an

unrelated audio source, colouring the sound with its unique spatialized imprint (Farnell,

2013).



Ambisonics


Developed in the 1970’s by a team from Oxford headed by Michael Gerzon, ambisonics is

considered one of the most realistic and advanced spatial audio systems available (Murphy

& Neff, 2010). The system relies on using particular microphone patterns or specially built

Soundfield Microphones to capture audio in a way compatible with the format. Recent

advances in VR have precipitated a new interest in ambisonic audio. According to Schütze

et al (2018) this is because ambisonics is not restricted to a specific number of channels. A

larger number of channels simply provides higher directional resolution with full spherical

directionality (including elevation) being achievable with only four channels.



Spatial Audio in Video Games


Surround sound within computer games first came into use with the release of the Super

Nintendo in 1990 (Rees-Jones, 2018). The SNES utilized Dolby Surround, which meant that

games produced using Dolby encoded soundtracks could be played back with the addition of a Dolby Surround decoder (Hull, 2004). According to Mike Clarke, an audio programmer at Psygnosis, this worked as follows: With each of the SNES’s 8 channels it was possible to set the L and R output independently. If you set L to positive and R to the same value but

negative or vice versa, the phase inversion would be detected by the surround decoder and

the sound would be sent to the rear speaker. This was, according to Clarke, a “total waste of

time”. Due to phase cancellation inherent in a standard stereo mix, random noise would

often be output from the rear speaker and if the SPC700 reverb was enabled, comb filtering

from the delays would send most of the reverb to the rear. Ultimately at the time most

players experienced SNES audio through a mono TV speaker meaning that “if you

intentionally put a sound through a Dolby Surround rear channel you were also intentionally

guaranteeing that most people would hear silence instead of that sound” (Sunley, 2020).

There were some notable examples of surround sound in SNES games. The additional

surround channel was often used to play ambient sounds or music, thus adding an additional

level of immersion to the experience. An example of this is the game Secret of Mana 2.

During certain cutscenes wind and fire envelop the player, whilst during battles the sound

effects are mapped to the front speakers allowing the soundtrack to continue playing through the rear (Lara, 2013).


While earlier examples of surround audio in games are numerous, (see appendix) usage is

often limited to increasing immersion through soundscape rather than performing a useful

ludic function. Perhaps the first such examples of games to use spatial audio as a core

gameplay element are from the FPS genre. Within 3D games, spatial audio provides the

ability to localize a sound source in 3 dimensions around a player giving instantaneous and

instinctual feedback of unseen and ever changing game-states (Collins, 2008).


A key development in true spatial audio for games was Aureal’s Vortex sound card and A3D

API in 1997. The sound card used hardware acceleration to model the three-dimensional

sound environment using HRTF (Chase, 2016). Around the same time Microsoft shipped

DirectX, a series of multimedia APIs designed to improve the speed that sound and graphics

cards could communicate, granting much greater control of real-time mixing and output

(Collins, 2008). This allowed the type of signal processing required to position sound objects

within a virtual space such as distance based attenuation and doppler shift to gain

widespread usage (Collins, 2008).



This marks somewhat of a golden age for hardware accelerated 3D audio. Following the

release of Windows Vista in 2006, Microsoft discontinued the DirectSound3D API (Schmidt,

n.d.) and while it was commonly believed that hardware acceleration was no longer needed,

true spatial audio in games suffered greatly (Chase, 2016). Software implementations in

games lacked the computational power needed to create truly 3D audio. But while hardware

accelerated 3D audio declined during this period as lower quality software alternatives filled

the void, channel based surround sound began to achieve standardization within the

industry.


Standardized use of 5.1 surround in video games arrived with the Xbox in 2001. Microsoft

used the DDL (Dolby Digital Live) codec in the console, which allowed for real-time encoding and decoding of 5.1 audio (Horowitz, 2014). One of the flagship games to take advantage of this advancement was Halo: Combat Evolved. According to Marty O'Donnell the 5.1 Surround in Halo was the aspect of the audio he was most pleased with. It was not without its issues however. The primary challenge encountered involved relinquishing control of channel sends to the audio engine based on camera position. In order to get character

animations to lip sync to the relevant sound and utilize room DSP, the developers attached

sound tags to the character model whose dialogue is currently playing. This caused sound to

transmit from the appropriate speaker based on the NPC’s relative position to the player

camera. This meant that speech files would jump suddenly from front to rear during close up

shots (O’Donnell, 2002).


Flash forward to 2020 and with the increased commercial viability of VR leading to a

resurgence in the use of binaural and HRTF technology, not to mention 20 years of

standardized 5.1 sound in games, what effect have these advances had on the design of

games and what opportunities did they create for game-sound designers? In the following

section a breakdown of the use of spatial sound will be discussed in three case-studies.

These games have been chosen specifically due to praise they received for exceptional use

of stereo, 5.1 and 3D audio.


Case Study 1 - Stereo: Inside (Playdead Software)



Inside is a 2.5D puzzle platformer noted for its intense atmosphere and audio-driven

gameplay. Absent of any dialogue and with a purposefully obscurantist narrative the game

presents a series of environments in which sound is closely integrated with the process of

puzzle solving thus close attention to auditory cues in the environment is critical to success

(Aghoro, 2019).


Upon starting the game a strong connection between the player and protagonist is

established via what the developers termed the character’s Voice (Schmid, 2016). This

consists of meticulously programmed footsteps and breathing sounds presented primarily in

mono. Early in the game a heartbeat sound is slowly pushed into the stereo field just prior to

the first encounter with the “marching husks” the player is subtly propelled to a heightened

sense of tension and connection with the protagonist. This is possibly due to the subversion

of expectation caused by a gradual expansion of character sound from the centre to sides of

the stereo field. The use of a stereo heartbeat also serves a ludic function, that of

persistently looping audio providing a metronome that guides the player through a battery of timing based puzzles (a mechanic used numerous times throughout Inside) (Arnold, 2018).


INSIDE - Heartbeat 5:30


The game often makes use of spatialization in order to define a sense of contrast, an

example of this is during an early puzzle in which the player must lure some chicks into a

thresher in order to knock a piece of hay off a wooden beam. Small environmental details i.e.

the chicks are a mono point source that move around the stereo field, while the room tone

and reverbs occupy a much wider space.


INSIDE - Barn Machine Puzzle


In fact, hard panning of sound sources is never used trivially and is often tied to a puzzle

mechanic. This further enhances the synchronisation between gameplay mechanics and

narrative, greatly deepening the sense of immersion (Ash, 2016).


Additionally, the game rarely uses non-diegetic sound, so when musical stingers are played

Andersen makes highly effective use of the stereo field, the stingers take on an almost

architectural function, defining the space the boy is inhabiting. A great example of this is not

long after the boy discovers an underwater submersible. Under the water the audio is

narrowed significantly in the mix giving a claustrophobic feel to the scene. Upon breaking

through a wooden barrier a gigantic pad sound erupts from nowhere as the wooden planks

explode and sink slowly into a colossal underwater area. This sudden switch between mono

and stereo is used to powerfully convey a sense of space.


INSIDE - Submarine Section


These sorts of panning and mix decisions are too numerous to describe as throughout the

game the localization of sounds in the stereo field is purposeful. One notable example

includes the sonification and panning of moving light sources found in a number of locations.

It has already been mentioned that Andersen believed that a surround mix would actually be

detrimental to player immersion and the way in which he makes full and often highly

calculated use of the stereo field to convey relevant ludic information without breaking player immersion presents a masterclass in meaningful game audio design.


Case Study 2 - 7.1 Surround: Alien Isolation (Creative Assembly)



Alien: Isolation is a survival-horror FPS set 15 years after the events of the original Ridley

Scott film Alien. According to sound designers from Creative Assembly, audio was intended

as a core aspect of the gameplay from the project’s inception. designer Sam Cooper stated

that the entire soundtrack was designed to be “contextually emotive throughout” (Ramsay,

2015).


Based on a critical listening session it is possible to detect the use of a number of

psychoacoustic phenomena including conditioning, signal listening, the cocktail party effect

and auditory pareidolia. These are combined in various ways to manipulate the player into

undergoing an extremely tense, immersive experience.


Auditory Pareidolia

More commonly understood as a visual phenomenon, pareidolia is the perception of patterns in randomness where none exists. For example seeing faces in inanimate objects or hearing obscured messages in music (Jaekl, 2017). This effect is used in a deliberate way in the game. This is achieved with subtle layering of alien vocalizations with the sound of closing doors or corridor groans (Broomhall, 2015). These sounds are highly localized allowing the player to accurately judge the distance and location of the sound while maintaining ambiguity.


Signal Listening

Signal listening occurs when the brain is trained on an expected sound event (Farnell, 2010).

The sheer frequency of large ambiguous sounds such as the aforementioned corridor

groans and stochastic bursts of environmental sounds conditions the player to be constantly

vigilant. Combined with the effects of pareidolia whether intentionally implemented or not,

induces a constant state of anxiety in the player.


The Cocktail Party Effect

An area where the game truly shines is the underlying system used to power emergent

moments of tension. This comes in the form of a dynamic system designed to replicate the

ability of the ear to filter out unwanted or unexpected sounds (Broomhall, 2015). An example

of this is during moments when the player is forced to hide from the Xenomorph. Real time

data is used to drive RTPC’s within Wwise, highlighting the player foley and the alien and

attenuating ambient sounds bringing the tension into sharp focus (Broomhall, 2015).

It was a stated intention of the creators of Alien: Isolation to wield diverse psychoacoustic

phenomena in service of inducing terror in players and in this regard they are successful.

The use of responsive audio systems to replicate these low-level psychophysiological effects

is a field of audio design at a stage of relative infancy. However, even 6 years after it’s

release, Alien: Isolation presents an extremely high bar for spatialization in game audio.


Case Study 3 - Binaural: Half-Life: Alyx



Valve’s flagship VR game, Half-Life: Alyx utilizes Valve’s Steam Audio SDK (Steam DB

Team, 2020). Representing, alongside Oculus Audio and Google Resonance, a popular

resurrection of two long neglected technologies: Ambisonics and HRTF filtered positional

audio. According to Dave Feise, sound designer at Valve, one of the biggest challenges

involved was getting sonic elements from Half-Life and Half-Life 2 to fit into the VR world in a

realistic manner. Many of the sounds were simple in terms of frequency content and as such

lacked a sense of localization. In order to spatialize better with HRTF the sounds needed to

be “dirtied up” by adding more frequency content (Walden, 2020).


Two creative hurdles experienced according to designer Roland Shaw included creating a

sense of scale and detail (Walden, 2020). This works well with large set piece sounds e.g.

during the beginning of the game where a “Strider” walks across the rooftops destroying

masonry as it makes its way through City 17. It gives an impressive sense of scale as it

approaches and attenuates very realistically as it moves further away.


Half-Life Alyx - Construction Strider 14:54


The same cannot be said for the minor handling details (of which there are many). The

weight of handling environmental based objects appears to lack a sense of physicality.

There is no heft to them as if in the real world. Due to synchresis, borrowing film-sound

terminology (Chion, 1994) it is impossible to decouple the sound from its source. This is due

to the fact that the box does what you would broadly expect a box to do when you pick it up

and throw it. You know what a box is supposed to sound like and what you hear

approximates that sound. But from the authors perspective, the immersion is broken due to

the fact it is a procedurally generated, approximation of the sound of a box. This effect

relates to Gestalt principles of expectancy (McClean, 2005) indicating potential future

avenues of research into the psychological relationships between sound and vision in VR

environments.


Realism enhancing aspects of this apparently simple example could be quite

computationally wasteful. For example recreating the unique properties of the reverb

emanating from within the cardboard box. Impact resonances scalable with the size and

material composition of the impacted medium. This is already achievable with more realistic

procedural models or high resolution audio ray tracing (NVIDIA, 2019), and both

technologies are now gaining more traction as computational technology improves.

Alternatively, simple eq and saturation on the physics based sounds could be used and a

priority assigned to these effects in order that they are dialled down until idle scenes or

moments where little action or music is occurring. Ultimately though, novelty aside, these

sorts of considerations are easily overlooked. Why spend tens or hundreds of man hours

designing a system only a small percentage of players will appreciate or experience?


Half-Life: Alyx Physics Interactions

Overall, the sound of Alyx is exceptional, however it does highlight the current limitations of

game audio technology. Arguably the discontinuation of DirectSound3D in the early 00’s set

truly immersive 3D audio back by almost a decade. Games such as Alyx represent a new

era in truly spatialized game audio. Along with Google’s Resonance, Sony's inclusion of

HRTF technology in the PS5, Oculus’s Native Spatializer Plugin (ONSP) all amongst others

mark a long awaited resurrection of the technology into more common usage.

1 view0 comments
bottom of page