Materialising Contexts: Virtual soundscapes for real world exploration


This article presents the results of a study based on a group of participant’s interactions with an experimental sound installation at the National Science and Media Museum in Bradford, UK. The installation uses audio augmented reality to attach virtual sound sources to a vintage radio receiver from the museum’s collection, with a view to understanding the potentials for this technology to promote exploration and engagement within museums and galleries in general. We see how contextualised and embodied interaction, along with authentic audio reproduction, can evoke personal memories associated with a museum artefact, and how, in turn, this engages interest in the acquisition of declarative knowledge. Through the adoption of a functional and theoretical aura-based model we present ways in which this could be achieved, and, overall, we demonstrate a material object’s potential role as an interface for engaging users with, and contextualising, immaterial digital audio archival content.

  1. Introduction

The project presented here attempts to directly apply audio augmented reality (AAR) as a means of promoting visitor exploration and engagement with museum artefacts and related archive material. 

Within the context of this article, audio augmented reality (AAR) is considered as a virtual audio augmentation of the physical and visual reality, or the physical artefact. In an approach similar to [1, 2, 3], a virtual audio soundscape currently replaces the ambient acoustic reality of the location, rather than mixing with it, a mixed reality experience is therefore realised through the meeting of physical artefact and virtual audio.

We show how a practice-based, research through design approach has developed a workable object detection and nomadic indoor positioning prototype that extends the capabilities of art and artefacts to advertise their presence to visitors through audio augmentation.

Additionally, we discuss the project’s potential for increasing public engagement with sound archive material, art and artefacts in relation to the system’s ability to extend the communicative potential of the museum and gallery object, and in relation to affording primacy to the sonic, rather than the visual. This is based on related contemporary sound studies literature, historical examples of augmented reality intervention within cultural institutional contexts, and the application of interactive spatialised audio.

Also described is how the project builds on some of the approaches outlined by Zimmerman & Lorenz in relation to the LISTEN system [2], namely the concept of the attractor sound, to develop an approach capable of longer range indoor positioning with a reliance on virtually no background technological infrastructure. 

  • Background

In the academic literature on museums, sound has been identified as having the potential to give exhibitions emotional power [4] and to generate multiplicity of interpretative perspective [5, 6]. The argument, in short, is that sonic exhibitions might help us to break from the truth effects of visual and textual storytelling and all of the asymmetrical power relations that they have been said to produce (especially in Foucauldian critiques of museums), opening the ground for visitors to ‘poach’ what they need from exhibitions, to borrow Boon’s paraphrasing [5] of Michel de Certeau. Museums have enthusiastically embraced the challenge of sound, identifying its potential to produce more entertaining exhibitions, most notably in order to deal with auditory subject matter as in the case of the V&A’s exhibitions ‘David Bowie Is’ and ‘Pink Floyd: Their Mortal Remains’ both of which provided a fully sound-tracked experience on headphones. Also of note is the Wellcome Collection’s less obviously crowd-pleasing 2016 exhibition ‘This is a Voice’ which used installed sound, mainly via contemporary art commissions, to tell the scientific, medical and cultural story of the human voice. This trajectory has established sound as an interpretation tactic in museums. However, there remains a live question about how to approach sound itself as an object of display.

There is also the challenge posed by the curious practice of collecting media technology and media content separately. The national sound archive is now held at the British Library, at one remove from the objects which once created and replayed recorded sound held largely at Science Museum and its regional branches, especially the National Science and Media Museum in Bradford. In response to the rapidly deteriorating physical state of British Library sound archive materials and others like it in regional collections, the Library has embarked on an ambitious programme of digitisation known as ‘Unlocking Our Sound Heritage,’ though there remains little sense of what public use will be made of this digital archive once it is made available. From a silenced collection of sound technology hardware to an abundant, even noisy, digital sound archive, there is at present little strategy or consensus about what might be termed ‘sonic engagement’ – the practice of engaging the public in the history of hearing, listening and sound. The question of what sonic engagement should mean and how it should be achieved in the context of museums of science and technology was taken up by the Gallery Listening Sessions project at the National Science and Media Museum.

  • Related Work

In addition to the exhibition based audio experiences outlined within the introduction of this article, there are a number of other related projects that provide useful reference points, particularly in relation to a similar applied use of AAR.

Zimmerman & Lorenz’s LISTEN system [2] provides an excellent example of the capabilities of AAR within the context of a cultural institution. The LISTEN project, which they describe as ‘an attempt to make use of the inherent everyday integration of aural and visual perception’, delivers a personalised and interactive location based audio experience based on an adaptive system model. It does this by tracking aspects of the visitors behaviour (which artworks have been visited, how long were they visited for etc.) to assign the visitor a behavioural model and adjust the delivery of audio content accordingly. The LISTEN system relies on a substantial technical background infrastructure to realise this personalised and invisible technical frontend experience for the visitor, who can wander freely through the exhibition space with just a set of customised headphones. LISTEN also introduces the concept of the attractor sound, which, based on the visitor’s personalised profile model, suggests other nearby artworks to the visitor that may be of interest to them via spatially located audio prompts. Furthermore, LISTEN characterises many of the key differences between the usual audio guide experience and an interactive, adaptive and immersive approach. These include the dynamic delivery of spatial audio based on the listener’s movement, and the delivery of related audio content based on the listener’s proximity to an exhibit.

Like The Rough Mile [7], Sikora et al’s archaeological AAR experience [3] could be categorised as an example of transformative soundscaping, where virtual audio is used to alter, or to reframe, rather than to directly compliment, the context of the locative experience. In the case of Sikora et al’s AAR experience this change of context is from rural to urban. Being an outdoor experience it relies, as does The Rough Mile, on GPS technology for determining the position of the user within the physical landscape, values from which are translated into coordinates on a virtually authored representation of this landscape based on satellite imagery, onto which are placed virtual sound sources for the user to encounter in the real world. A similar authoring approach is taken by the system presented here, though being for indoor experiences they rely on a custom indoor positioning approach rather than GPS for determining the position of the user in virtual and physical space.

Seidenari et al’s work on an automatic context-aware audio museum guide [8] demonstrates how a combination of both context modelling and artwork detection work together to influence the playback of audio descriptions. It also shows how the current object of the visitors focus is determined by a wearable camera based object recognition system. Additionally, the inclusion of speech detection within Seidenari et al’s context-aware audio guide suggests a desire for users of such systems to maintain the ability to socially interact with their co-visitors, or rather it tries to ensure that visitors can still talk to other visitors. This ability is maintained in addition to an understanding that personalisation is a key factor in enabling museums to talk with visitors, rather than talking to them.

The project presented here, where an AAR installation environment has been created using vintage radio receivers and contemporaneous archive radio broadcast recordings, has subsequently been found to be similar to one referenced by Bijsterveld [9]and presented in detail by Mortensen & Vestergaard [10] within what they term a listening exhibition curated at the Media Museum in Odense, Demark in 2012 titled You are what you hear.

Through their implementation of their Exaudimus system [10] in the You are what you hear exhibition, Mortensen & Vestergaard propose a way of exhibiting and interfacing with radio heritage which has been enabled by the digitisation of analogue audio archive content by the Danish Broadcasting Corporation. Within this approach we see how, through authorship and embodied visitor interaction, the exhibition demonstrates potential as an accessible and immersive interface to the sound archive itself.

We can imagine the audible output of the two projects to be of a similar nature, given the similar context and type of physical and virtual audio artefacts used. But the application of different technological solutions within each AAR system and the apparent absence of three-dimensional audio spatialisation within the Exaudimus system, along with differences in the material contextualisation of the audio content (listening situation verses direct augmentation of the sound artefact), denotes the issues around both authorship and user experience being very much different. 

Whereas the Exaudimus system [10] utilises a multiple fixed camera tracking system, which tracks different coloured lights mounted on top of the users headphones in order to determine the position of a specific user within the installation environment, the prototype presented here employs a single handheld mobile camera based tracking system along with Simultaneous Localisation and Mapping(SLAM) to determine the user’s physical location in space in relation to the virtual audio sources.

  • Approach and Methodology

The study employed a practice-based, research through design approach where a series of iterative prototype interactive sound installations where developed through a cyclical process of development, deployment, study, analysis and redevelopment [11]. Ethnographical studies and the subsequent ethnomethodological analysis [12, 13] of the deployed prototypes, where both experts and prospective audiences were invited to participate and interact with the technologies within the installation environment were undertaken. These study and analysis phases then played a role informing the subsequent redevelopment phases. The expert’s and prospective audience’s actions and interactions were observed, recorded and analysed in accordance with recognised ethnomethodological techniques, including the development of thick descriptions and a detailed understanding of the machinery of interaction [12, 13]. Additional data in relation to the participant’s experiences was obtained from post-participatory questionnaires and interviews.

  • System Description

Fig. 1 System architecture of prototype

The current prototype installation is delivered to listeners though headphones connected to a smartphone running an application which is authored using Unity [14], FMOD [15] and the Vuforia SDK [16], see figure 1. 

Each sound source either has an audio logic script attached to it, or is attached to an FMOD event, which is provided with the current distance and orientation values of the listener in relation to it, which it uses to control the delivery of the audio source to the listener. This includes its spatial position within the virtual soundscape, based on the listener’s orientation in relation to the virtual sound source and the real-world object, and its attenuation within the virtual soundscape, based on the listener’s distance from the virtual sound source and the real-world object. The spatial position and the attenuation of the sound source within the stereo binaural mix of the virtual soundscape are the primary audio logic parameters which all the sound sources contain in order to place them within, and construct, a convincing and viable interactive and virtual three-dimensional soundscape. Based on these orientation and distance values other audio logic events can be scripted, such as the delivery of different audio files, or sections of an audio file, based on the listener’s position in relation to the source.

Informed and inspired by the artwork detection project presented by Seidenari et al. [8], and a realisation of the need to employ image recognition technology as a means to develop an application that was useable from both an authoring and curatorial perspective in a variety of locations, the Vuforia SDK [16] was adopted as a means to realise this. Along with artwork recognition, the use of image recognition and tracking technology also presented opportunities for the development of an Indoor Positioning System (IPS).

The Vuforia SDK enables the development of mobile augmented reality applications that use computer vision technology to recognise and track image targets and three-dimensional objects in real-time, and is compatible with both the iOS and Android mobile application platforms.

The Vuforia Engine’s camera-based object recognition and tracking capabilities not only facilitate the recognition of the artwork and artefacts to which virtual audio sources can be associated, but also enable the implementation of an IPS were the mobile listener’s angle and distance can be determined in relation to tracked, stationary two or three-dimensional objects. 

Through an authoring approach similar to the one presented in the LISTEN system by Zimmerman & Lorenz [2], where a world model is combined with a locative model, we can determine our listener’s position both in the physical and virtual environment of the experience. Additionally, the system is also capable of determining the listener’s current focus by returning the angle and distance of the listener in relation to the tracked object.

An additional and important feature of this camera based IPS is made possible through Vuforia’s Extended Tracking or Simultaneous Localisation and Mapping (SLAM) capability, delivered through either Apple’s ARKit [17] or Google’s ARCore [18], when compiled for delivery as either an iOS or Android application respectively. Vuforia’s extended tracking enables the continued recognition and estimated location of a tracked object outside of the camera’s field-of-view. This fusion based sensing technology extends our ability to determine the location of our physical objects and their associated virtual audio sources in relation to the listener’s position in space. By being able to estimate both the angle and distance of the virtual audio sources around the listener, we can deliver a virtual and interactive three-dimensional soundscape based on the listener’s physical, real-world environment.

Initial prototype designs centred around tracking the objects to which the virtual sound sources where going to be attached to, and using these as reference points to determine our listener’s position and orientation, an approach that seemed natural given that these were the objects that we wanted to detect. But through the prototype development stages, once a system had been developed that demonstrated a useable degree of accuracy and reliability, and through the trials and manipulations involved in sculpting the positions and dimensions of the virtual audio sources in physical space, a ‘natural feature’ detection approach emerged. This approach involved providing the object tracking software (Vuforia) with isolated images of unique and static physical features within the experience environment, and determining the listener’s position and orientation in relation to these physical features, and in-turn determine the position of the user in relation to the object to be augmented with sound.

  • Authoring and Development

Naphtali & Rodkin [19] define a core set of components required to construct an AAR system. These include: sensors, control methods, rules and conditions and a delivery mechanism. Within this particular AAR system we can define our sensor component as being a camera, which will provide real-time tracking of our listener’s position and for recognising environmental elements. Our control methodsare virtual colliders, authored zones of space in the virtual environment, the position of which in the real world physical environment can be determined by our sensor component. These colliders act as triggers for our rules and conditions, which is essentially the authored logic that determines the audio content delivery. The delivery mechanism, the device with which our listener will interface with system, comprises of a smartphone and headphones, the former capable of realising our core set of system components either via an installed application, or intrinsically via its hardware, the latter capable of delivering personalised, high-fidelity, three-dimensional sound.

In an approach similar to Thielen et al. [20], the mobile application for the Listening Session study was developed using the Unity Game EngineFMOD Studio adaptive game audio authoring tool, and the Augmented reality library Vuforia. An image target, in the form of a QR code, was uploaded to Vuforia where the image feature points are extracted and stored in a database. This image target was included as a game object within the Unity scene, with another game object added as a child of this image target object, to represent the virtual audio source. This child object was positioned virtually in relation to its parent image target to reflect the actual required position of the virtual sound source in our real world environment, see figure 2. 

Fig. 2 On the left, we see the virtual environment during development, showing the position of the virtual audio source and its collider component in relation to the position of the tracked image. On the right, the position of the tracked image in relation to our radio object in our real world installation environment. The speakers of the radio were situated in the bottom of the main body of the radio unit.

The authored FMOD audio event was attached to this child game object, along with a collider object for triggering it. Key to this authoring approach working in relation to the designed model of spatial interaction (figure 4), and in relation to the previously discussed composite elements of an AAR system, as identified by Naphtali & Rodkin [19], is the use of collider components on both the virtual audio event triggers and on Vuforia’s ARCamera object. The addition of a Rigid Body component on the latter, combined with these collider components, renders our user’s mobile camera position within both the virtual and physical world of our AAR application much the same as a first-person perspective player within a video game and, as such, other similar game orientated authoring approaches can be adopted within FMOD. The approach of commandeering game authoring techniques, specifically collision detection, for spatial augmented experiences, is also utilised and reflected upon by Greenhalgh & Benford [21] in their model of spatial interaction for a remote teleconferencing application.

  • Spatial Interaction

The appropriation of the VR authoring technique of collision detection through the placing of collider components around the virtual sound sources and the ARcamera object begin to realise a model of spatial interaction with similarities to Greenhalgh & Benford’s [21] Auras; spatial zones around objects that define their region of interaction with other objects. Similarly, this approach enables an awareness of these objects to each other, indicated by their position and orientation. This awareness can be used to design a model, and author a subsequent experience, that can take advantage of this information to determine a user’s current focus within the system, and also allow an object to determine if it is a current point of focus.

The designing of focal length and width for individual virtual sound sources within the model can be achieved though the dimensions of both its range and its associated collider, the shape of its directivity pattern, and also though the attenuation of its signal based on the parameters of distance and angle between it and a listener. Again, this echoes [21] and the concept of the nimbus feature of an object as both a focal and advertising determiner.

By adjusting the audible presence of virtual sound sources based on listener’s proximity and orientation we can design an element of focus into the experience where individual sound sources and objects can be identified and coherent and curatorially useful soundscapes can be composed. 

Additionally, this audible presence could be manipulated, or focal range extended, in order to give specific sources priority, or to enable them to advertise their presence more vocally than other sources within the experience.

It is these points that are of particular interest as they constitute a manipulation of the usual, or expected, attributes of a physical sound source. According to the normal physics of sound, these sound sources would perhaps continue to emanate through the soundscape, with only the altered characteristics virtually attributed to them by the game engine’s audio spatialisation effects, perhaps through means of occlusion, change of environment, volume and position within the game or experience.

We therefore perhaps see here an emergence of a model for spatial audio interaction for use within applied AAR systems, where a considered compromise is brokered between audio reality and a functional and coherent application through spatial interaction.

Fig. 3 The spatial audio interaction design for the Listening Session study.

In figure 3 we see the spatial audio interaction design for the Listening Session study. In the centre is the physical vintage radio artefact, which is represented in the virtual space by the QR code on the floor below it (as shown in figure 2). There are four looped virtual archive radio broadcasts positioned around the QR code image target, these are positioned at 0°, 90°, 180° and -90°, and are indicated by areas A, B C and D on the diagram respectively.

For the purposes of this initial study, a 1950’s television and radio receiver was selected from the museum’s collection, and contemporaneous archival radio broadcast material was obtained from an online internet archive resource. This material included a science-fiction radio drama, a live concert hall musical performance recording, the narrated introduction to a religious music program, and an episode from a detective drama serial. All the chosen audio content was historically and geographically accurate in relation to the chosen radio receiver from the museum’s collection. In addition to the recorded archival radio broadcast audio content, various recordings of radio static were obtained by recording the output from an out-of-tune contemporary radio receiver. Figure 4 shows a table of the included audio content and details of their attributes.

Fig. 4 Details of the audio content included in the installation.

The real world positions of these virtual archive radio broadcast transmissions are achieved within the FMOD event authoring environment by cross-fading from the background radio static sound to the appropriate archive recording when the listener is in the relevant position in relation to the tracked QR code. The cross-fading between these two audio sources is extended by 10° in each direction from its centre position, with a further 10° transitionary non-linear cross fade to allow for a degree of comfortable, and smooth transitional listening, so small body movements do not result in sudden loses of the perceived broadcast signal. The audio sources were positioned around the radio in this fashion to promote 360° exploration of the physical artefact, and so that multiple audio sources could be tuned in to through the embodied interactions of the listener around one augmented source. The fine tuning of the crossfade angles were a result of trial and error in the authoring process in an attempt to a create smooth and seamless auditory experience, and to try and emulate the tuning of an analogue radio dial with bodily movement.

It is the listener’s focus, along with their position in relation to the virtual sound source, which is situated in the same physical location as audio augmented object, that also determines the delivery of the audio content to the user. Within the context of this study, and the associated interactional model, the listener’s focus is determined by the angle of their iPhone in relation to the tracked image target. It is this, in addition to their bodily position in relation to the tracked image target, that provides a spatial interactional model that encompasses degrees of listener position, proximity and focus.

These three spatial interactional variables (position, proximity and focus) and their associated outcomes in terms of audio content delivery for the respective listener can be illustrated through a closer inspection, and a comparison, of the positions of listener 2 and listener 3 in the interaction design diagram figure 3.

In figure 4, we see listener 2 at a position of -90° in relation to the radio object and the tracked target, and therefore currently at a position where he can hear broadcast D, in contrast to listener 3, who is at a position of 90°in relation to the radio and therefore can currently hear to broadcast B.

It is also these listener’s proximities to the radio object that also determines if they are currently within hearing range of the broadcasts located at their current positions, and the degree to which the broadcast’s signal is attenuated and mixed with background static. We can see that both Listener 2 and Listener 3 are within range of broadcasts D and respectively, and therefore are able to hear these broadcasts, though Listener 3’s closer proximity to the object means that its signal will be less attenuated than Listener 2, who is further away.

Our last interactional variable, that of focus, is also illustrated by Listener 2 and Listener 3 within figure 4. The focus variable is determined by the angle of the Listener’s device in space (in this case their handheld iPhone) in relation to the position of the tracked object. We can see that Listener 2 is facing away from the radio, with it situated on their immediate right-hand-side, and as a result will perceive the spatialised audio content as being emitted from their right-hand-side (the direction of the radio). In contrast we see Listener 3 directly facing the radio, who, as a result ,will perceive the virtual audio sources as emanating from directly in front of them.

In light of this explanation of the spatial interactional variables of positionproximity and focus we can determine the differences in the delivery of audio content for all our Listeners locations in figure 4. Perhaps notable here is Listener 6, who, although directly facing the radio, will hear nothing as they are well outside the range of both static and broadcast. Similarly, we see that the location of Listener 5determines that, although they are within range of the static, with the radio directly in from of them, they are beyond the range of the broadcast.

Furthermore, the delivery of audio associated to the radio object to Listener 2 in figure 4 has the potential to encourage engagement with the object by tempting their focus, but also leaves them open to impressions of other potential virtual sound sources within the context of an experience with multiple audio augmented objects.

Both the real world distance and angle between these virtual sound sources and the user can be accessed as parameters within FMOD in order to author adaptive transitions in the delivery of the audio content. This is achieved in the same way a player’s character may experience virtual sound sources when exploring the virtual domain of a video game, or the way in which instrumentation within an adaptive soundtrack may be manipulated in relation to the players health, or the proximity of enemy characters.

  • The Study

The Gallery Listening Sessions, were a  set of workshops exploring the question of what ‘sonic engagement’ should mean, and how it should be achieved in the context of museums of science and technology. They invited interested parties to take a guided tour of the museum’s collection stores and take part in a small number of workshops around the question of ‘sonic engagement’. Also, after the museum tour, attendees were invited to take part in our AAR study

A total of 10 attendees to NSMM’s Gallery Listening Session participated in the AAR study, and these participants were reflective of the Gallery Listening Session attendees in general; researchers, museum professionals, museum visitors and members of the public, of mixed age and gender. 

Participants were handed the iPhone and instructed to wear the headphones, ensuring they were on the correct way around, and to explore the radio object, no additional information regarding what would happen, how the technology worked, or what they could expect, was provided.

Due to the developmental nature of the application, participants were provided with iPhones with the mobile application already installed, and the appropriate application was either started before handing over the iPhone to the participant, or pointed out to the participant amongst the other app icons on the iPhone’s home screen.

So that the social observations between two users could be studied, participants either self-organised themselves into pairs, or the pairings were the result availability to participate either having completed other workshop activities, or having completed the required ethics paperwork and consent documentation. 

Both video and audio recordings were captured of the participants just prior to, during and after their engagement with the study. Participants were also given the opportunity to provide both verbal and written feedback relating to their experience of the study subsequent to their participation. Verbal feedback was captured on the video camera and took the form of an open ended discussion. Written feedback was collected on feedback forms completed anonymously by participants as free text.

  • Findings

The written feedback was collected on feedback forms completed anonymously by all participants as free text, prompted by the question How would you describe your experience with the augmented radio?  Verbal feedback was captured on the video camera’s microphone, with participants being asked, if they were not initially forthcoming on their own accord, what they thought about the experience they had just undertaken. The bodily interactions between all pairs of participants and the radio installation were recorded on a single, wide-angle video camera that covered the interactional setting of the installation. From this view participants were recorded entering, interacting with, and leaving the setting of the installation.

In the written feedback, all but one of our ten participants described their experience as being either ‘interesting’ or ‘fascinating’. Two participants commented on the authentic ‘valve warm sound’ and the ‘period appropriate programming’, one also commenting that ‘It was interesting to have new technology used to interpret a story about an older object’ and that they would like to see this technology used throughout museum.

Two participants made direct references to how their bodily movements were tuning the radio into the different broadcasts, and likening this to their practical experiences and memories of tuning a traditional radio receiver.

There were comments made about being able to listen to individual broadcast material, as well as being able to construct or compose an individual soundscape experience from the different elements available; ‘picking up and losing the sounds’. 

Additional positive references were made to the exploratory nature of the experience and its potential for being adapted as a maze, puzzle or mystery solving experience. One participant mentioned that they would have liked additional visual or textual information displayed on the phone’s screen to compliment and provide information about the audio they were currently listening to. This feature was also suggested as an additional means of navigation within the experience, to visually indicate the whereabouts of specific sounds or, if you miss something, provide a means by which it could be easily found again.

In relation to the verbal feedback, participants identified with the experience of using their proximity and their position in relation to the radio to find the broadcast material amongst the sound of static as being a metaphor for what it may have been like to originally tune this type of analogue radio receiver. Several participants made reference to the tuning metaphor and how it reminded them of their direct and personal experience of tuning in an analogue radio receiver. Also mentioned, again, was the ‘Faithful reproduction of the warm valve sound’.

Participants also expressed an interest in further levels of sonic engagement with the object, for example one participant mentioned that they almost expected to hear ‘more stations when pointing the phone at the tuning dial on the radio’.

Two participants made reference to the ‘abstract’ nature of the experience, and expressed interest in having a more literal and faithful relationship between the object and the delivery of the audio content.

An initial analysis of the findings highlight a common interactional sequence as illustrated in Fig 5. Generally, we observe eight distinct phases of interaction with the experience; preparation, familiarisation, exploration, investigation, focussed listening, second-levelfocussed listening, interruption and finishing. We see how, through a process of familiarisation, our participants quickly associate their bodily movements to the receipt of the spatialised audio sources, and then begin to explore the interactional setting to see what they can find. Subsequent to this we witness our participants returning to investigate the location of some of these sources and engage in listening to them. This phase of focussed listening can sometimes result in a more attentive and engaged listening activity, observable by participants attempting to achieve a very close proximity to the location of the virtual sound source. We also see how personal space and acceptable social proximities effect the process of virtual sound exploration and investigation, and how these predefined, and mutually agreed proximities become more flexible during phases of engaged listening. We’ll now look at each identified interactional phase in a bit more detail.

Fig. 5 The identified different phases of interaction within the Listening Session study and their relationships to each other.

9.1 Preparation

It is envisaged that the application will eventually be made available for listeners to download on to their own devices, enabling institutions to economically deploy such experiences. As such, familiarity and access to an appropriate device would be assumed, with the exception of the listening station approach discussed earlier. Although all participants automatically put on their headphones when they were ready to start, four participants needed to be reminded to put their headphones on the correct way around (essential for the correct orientation of the interactive surround sound) and two out of our ten participants required instructional prompts to engage in an exploration of the space, although it seems that participants who had already observed the interactions of other participants required no such instructions.

9.2 Familiarisation

This phase of familiarisation is largely distinguishable by the various lateral movements our participants made. This seems to be a process of familiarisation with the association between bodily movement and the interactive positioning of the surround sound. These movements are often terminated by an acknowledging sign of appreciation, perhaps confirmation that the association has been recognised and understood.

These  lateral movements were observed being performed in a variety of different ways. Some participants swayed from side-to-side with their device held in alignment with their body and head. One participant waved their device in a lateral motion within a few moments of starting the experience, and kept their body stationary whilst doing so. Another participant rotated their upper body in a lateral motion, and therefore also the device they were holding.

During this phase of familiarisation a detachment of the focal gaze from the screen of the device was also observed, in other words, the participant, through their particular process of positional familiarisation, was observing the physical object directly, rather than secondarily through the screen of the device.

This initial process of familiarisation of embodied interactions with spatialised audio via repeated lateral movement is consistent with . Also consistent with the Audio Torch project [x], is the way in which it is capable of achieving a very quick link between hand and ear, a link that, in most part, remains unbroken and which can be observed by participants keeping their head aligned with the orientation of the device in their hand for the duration of the experience.

9.3 Exploration

After the brief familiarisation stage described above, participants were observed walking around the radio a full 360˚, often pausing briefly at the locations of the audio signals, as indicated in figure 4. The direction of exploration, clockwise or counter-clockwise, was usually determined by the first participant to start moving around the object, equally the length of the participant’s pauses at the locations of the audio signals were often determined by one participant resuming their exploration around the radio and prompting the other to resume theirs. This behaviour leads to each member of our pair of participants exploring adjacent locations of the sound source, as one member begins to travel to the location of the next broadcast, so does the other member.

This type of exploratory behaviour is observed amongst all our participant pairs, though there are some occasional exceptions. These exceptions appear to take place either when one of the participants has become engaged in the next phase of investigatory interaction, or if the participants appear to have a greater degree of social familiarity with each other, which can be indicated by an observed acknowledgment of each other and a sharing of an appreciation of the experience.

9.4 Investigation

Within this phase we see members returning to the locations of the audio broadcasts that they identified during their exploratory phase to investigate them further. We also begin to see exploratory interpretations of the smartphone device as an interface to the audio content. These interpretations take on a variety of styles, with one participant holding their device aloft in an antennae type fashion, directly reflecting the subject of both the virtual and the physical, another uses their device as a virtual microphone, moving it towards points of interest around the artefact. Others listen through the window of the screen, or rather, observe the radio through the screen of the device whilst listening through their headphones. During this phase of interactional activity we also observe participants sharing the same audio sources and interacting with the installation in much closer proximity to each other.

9.5 Focussed Listening

The investigation phase, were our members revisit the virtual audio broadcasts they identified within their exploratory phase, quickly develops into focussed listening, discernible by the participant remaining stationary for a prolonged period for the first time since beginning the interacting with the experience. Also evident within this interactional phase is a disassociation with the physical object itself, with participants being observed closing their eyes or seemingly focussing on other more distant objects whilst they concentrate on the audio content. Despite this apparent disassociation with the object whilst engaged in these periods of focussed listening, these events initially take place at either the front or the back of the object, areas of distinct visual interest compared to the two rather plane wooden sides, with the exposed electronic and mechanical insides at the rear, and the TV screen and radio dials at the front. This behaviour is observed despite the location of the two audio broadcasts at the sides of the object, as shown in figure 3.

9.6 Second Level Focussed Listening

We witness, during the focussed listening phase, moments when our participants engage in listening in much closer proximity to the object, often crouching down in order to obtain a physical position very close to the centre of the virtual sound source. This happens exclusively at the front or to the rear of the object where the object’s mechanical and electrical interfaces and inner workings can be seen respectively.

9.7 Interruption and finishing

The interruption of a participants activities, which often resulted in them finishing their interaction, resulted from one of the pair of participants deciding they have finished.

Evident throughout the study, in all but one of our 5 pairs of participants, the end of participation is initiated by one participant removing their headphones, which prompts the other to do the same, even though the participants never started at exactly the same time.

In the one event in which this did not happen, the other participant was engaged in second level focussed listening.

Our participants, on average, spent 3’17” exploring the installation. The total length of non-looping, audio available to listen to was 6’20” (excluding the looped background static recording). Therefore, if we assume that none of our participants listened to the same piece of audio more than once, we can say that on average our participants listened to 51% of the available audio broadcast material. Only one of our participants reported a potential fault with the system.

  1. Discussion

10.1 Serendipity Versus Declarative Knowledge?

Throughout their description and analysis of their deployment of the Exaudimus system, Mortensen & Vestergaard [10] iterate that their interest lies in the creation of serendipitous moments of engagement, rather than assisting the listener in the collection of declarative knowledge on the subject matter. As maintained by Truax [22] and Mortensen & Vestergaard [10], such serendipitous encounters have the ability to realise engaging cultural experiences and also have the potential to extend interest in the exhibition subject matter beyond the duration of the exhibition.

Such serendipitous and explorative expeditions could be likened to Debord’s theory of the derive [23], a détournement where one is concerned with the potential points of departure, rather than a specific destination. Mortensen & Vestergaard [10] make reference to this type of take-away chance encounter or, recontextualization of the familiar or seemingly mundane, that acts as a catalyst for extended engagement.

Evidenced within the findings of the study presented here, as within the study conducted by Mortensen & Vestergaard [10], it is the role of personal memories, and the triggering of them, that plays a crucial role in realising these moments of serendipity which, in turn, result in moments of engaged exploration. We witness here, as we do in [10], that virtual sound sources, when combined with physical artefacts to realise an AAR experience have a clear and powerful ability to produce such engaging experiences. Truax [22] explicitly attributes this phenomenon to sounds ability to create relationships between listeners and their environment, combined with a relationship between embodied interaction and embodied cognition, the idea that bodily movement influences our process of acquiring knowledge and understanding.

Truax’s suggestion [22] also invokes Bull’s [24] observations on the use of personal portable audio systems, through which users have been augmented their environments for decades, and perhaps point towards the importance of nomadic agency within the systemwhere listeners remain free to explore their own relationships between virtual sound, the physical environment and its contents.

Though, evidently, we should not dismiss the ability of serendipitous experiences to increase engagement, awareness and understanding of subject matter on their own, our identified phases of focused listening, also observed by Montan [25], offer opportunities to create moments within the experience when declarative knowledge could be imparted. 

With the Exaudimus system, Mortensen & Vestergaard elude somewhat to an either/or experience, but we can, perhaps, have our cake, and eat it. By initially engaging listeners with chance serendipitous encounters we could draw them into phases of focused listening during which declarative knowledge can be imparted. The key being, how can we be sure, or at least maximise the chances, of the existence of these initial serendipitous encounters?

10.2 The Functional and Contextual Auras

We discussed previously the concept of the aura in terms of its functional role within the model of spatial interaction, namely its role in determining how individual sound sources within the soundscape communicate their presence to the listener at a systematic level [21]. But we can also think of an aura, perhaps in the more traditional sense of something having an aura, as the perceived meaning of a specific object or location.

For MacIntyre, Bolter & Gandy [26], the aura of an object or location is a combination of its cultural and personal significance. But in order for a beholder to understand an object’s cultural significance, they need to have somehow acquired that knowledge about the object or place; a field is perhaps just a field, until you know that it is, in fact, a battlefield.

With this view, giving an object the ability to communicate information about itself to the listener gives it the ability to extend its aura, or perhaps its ability to have a perceived aura, by communicating its cultural significance.

This is perhaps of interest and importance when we start thinking about how to capitalise on serendipitous encounters within the system, and imparting declarative knowledge through them. Within the structure of a dual or multi-focal model, we could think of serendipitous encounters with an object’s aura as being those of personal significance, and the subsequent encounter being that of obtaining declarative knowledge of an object’s cultural significance.

Participants also expressed an interest in further levels of sonic engagement with the object, which point towards a possible macro and micro focus approach within the design of the spatial audio interactional model. For example one participant mentioned that they almost expected to hear ‘more stations when pointing the phone at the tuning dial on the radio’. These reports seem reflective of the findings of Montan [10] in relation to the design of different ‘acoustical zones’ within the context of a single AAR subject for increasing immersion and engagement, where there was a reported impression of entering into the subject, when moving from one zone to another. Such an approach is also consistent with the work of artist Vicky Browne, where the elements of Browne’s sound installation Cosmic Noise, are described by Kelly [27] as having ‘micro-ecologies’, where the work can be listened to as a whole, or attention can be focused on certain elements to reveal ‘specific and often minute sounds’.

The use of embodied interaction as a metaphor for tuning into the radio, along with historical audio realism may provide another approach to answering the question of how to evoke personal and emotional relationships with objects. Participants mentioned how it reminded them of their direct and personal experience of tuning-in an analogue radio receiver. Also mentioned was the ‘Faithful reproduction of the warm valve sound’, which may also have helped to place, and attach, the virtual to the physical and constitute an increase in engagement with the artefact that is a direct result of the audio augmented reality experience.

10.3 Artefact as interface

Furthermore, and more specifically related to the methods and rationale involved within the design decisions of Mortensen & Vestergaard’s practice-based approach [10], we see how an experimental study approach is deployed as a means of exploring the potential ways in which archival sound content could be accessed in an accessible and engaging manner. This is advocated for through the analogy of ‘an informational amusement park of the future’, within which there is an emphasis on maximizing the possibility of the occurrences of serendipity (defined as unexpected discoveries) as a means of promoting awareness, engagement, reflection and inspiration through the experience and exploration of embodied interaction with sound, rather than the explicit gathering of declarative knowledge.

As with the radio based installations presented here, this is achieved by using the body like a tuning dial on an analogue radio set, allowing the visitor to find clear signals of archival content amongst the sound of static. Bijsterveld [9] describes this as a ‘highly original framing of the exhibition sounds’ and one where ‘the exhibition space itself mimicked the technology behind the sounds that were the topic of the exhibition’. One could also argue that this act of embodied, interactive tuning constitutes a physical contextualisation of the virtual digital archive content. 

This physical contextualisation is extended through the construction of listening situations, where the setting of the original physical listening environment associated with the different pieces of audio content (an armchair for content programmed in the evening, a car seat for drivetime content, and a bedroom for teenage content etc.) are reconstructed within the gallery space. Again, we see an exploration into how the material can be used to promote and focus engagement with the immaterial, a mixed reality exercise in the contextualisation of the virtual with the physical.

This approach is largely justified by an understanding that learning associated with immersion is experience driven [10] and, as such, the authors anticipated the potential visitor learning outcomes to include experiencing situations, feelings and memories as opposed to hard facts. This approach seemingly bears fruit in the form of positive participant feedback in relation to awareness, interest, engagement and the evocation of memories associated with the audio archive content used within the exhibition. 

The authors admit that there is no evidence that the experience would inspire further engagement with the archive beyond the scope of the exhibition, and that there were significant problems in getting visitors to physically interact with the assembled listening situations, for example actually sitting down in the armchair. 

It was the embodied and intimate interactions with these assembled physical situations within the gallery space that were required in order to effectively trigger the playback of the associated archival audio content, and as such the issue of exhibition competence in relation to the hands-on engagement that this type of approach relied upon remains.

10.4 Hands-free, Heads-up

It is the commandeering of the smartphone as a delivery mechanism that, to a great degree, enables the potential permeable nature of the experience we discussed in the previous section, one can easily imagine the impractical nature of loaning multiple VR headsets, or other less ubiquitous or expensive pieces of equipment to visitors as they wander around either inside or outside the confines of the gallery or museum space. As such it is the smartphone that also allows for an accessible experience, an accessible experience which is facilitated by both ease of deployment from the point of view of the institution, and ease of use from the point of view of the user.

The current prototype AAR mobile application deployed in the studies presented here can be successfully installed on most Android and iOS smartphones of up to six years in age, with a set of stereo headphones being the only additional piece of equipment required.

Although we should not assume that every visitor would be carrying a compatible smartphone, the largely prevalent nature of this enabling technology, along with the lack of reliance on the type of background infrastructures evident in some of the previous gallery based AAR experiences that we discussed earlier [1, 2, 3], affords a greater amount of inclusivity and accessibility for both the visiting public and the institution within which it is deployed. 

The need for only relatively dated technology, in smartphone terms, permits the somewhat inexpensive deployment at an institutional level should an even greater level of accessibility want to be provided through the use of listening stations, similar to those in The Damm Project [9]. It is envisaged that, although forgoing the true nomadic nature of the unfettered experience that has been described, the installation of listening stations in a gallery space could provide 360 degree scenes of the soundscape from a stationary position from which individual audible components of the virtual soundscape could be discerned along with their physical counterparts.

By primarily concerning ourselves with an audible experience over a visual one, we not only place less demand on, and the need for, more technological advanced and potentially expensive resources, we also make such an experience accessible by more people, and more deployable, in economic terms for the institution. The deployment of such AAR experiences within cultural institutions has the potential to bypass situations such as the queue of visitors waiting to have a go on a limited number of VR headsets, or other experiences that render themselves prohibitively exclusive through the use of technology that is not as ubiquitous within the public domain, or, in short, were visitor’s own technology is not capable or considered within the deployment of interactive and immersive experiences within cultural institutions.

  1. Conclusions

By way of a conclusion we witness a positive reaction to users having agency over the composition of their own virtual soundscape to accompany their gallery or museum visit, and to the application of new and familiar technology to explore old technology and its related subject matter. 

We see evidence of how contextualised and embodied interaction, along with authentic audio reproduction, can evoke personal memories associated with a museum artefact, and we see participants express interest in the acquisition of declarative knowledge, based on these initial engagements with the subject matter. Additionally, there appear practical ways in which this can be achieved through a dual, or multi-layered focal structure, through the adoption of an aura-based functional and theoretical model, and our observations suggest that users would engage with such an approach. 

Overall, we demonstrate the potential of the physical object’s role as an interface for engaging users with associated virtual audio content. We also demonstrate an initial prototype system that has the capability to impart declarative knowledge to users by exploiting initial serendipitous encounters, and present ways in which this specific capability could be extended and refined.

Also of interest is how users become less aware of each other’s presence as they become more engaged with the audio content and, as a result, initial social constraints become more flexible as participants become more engaged.

Finally, by assigning real world objects specific virtual spatialised audio sources we demonstrate how these objects can communicate and engage beyond their traditional confines of line-of-sight, and how visitors can be drawn to engage further, beyond the realm of their original encounter.

  1. Further Work

Though, arguably, this approach performed well in teasing out initial findings to inform prototype development, and also as a means through which engagement could be initialised, it stands mainly as a catalyst through which a secondary level of exploratory engagement could be initiated. Some other ideas for further work based on these findings could include ways in which the preparational and familiarisation phases of interaction could be combined through interactive spatialised audio instructions, potentially shortening the user’s route to engagement. And how a user’s personal preferences and data could combine with audio meta data in order to promote moments of personal attachment and memory by generating personalised auras within the experience, thus helping the dissemination of declarative knowledge. 


[1] Bederson, Benjamin B. Audio Augmented Reality: A Prototype Automated Tour Guide. Proceedings of CHI ‘95, May 1995, pp. 210-211. 

[2] Zimmermann, A., & Lorenz, A. (2008). LISTEN: a user-adaptive audio-augmented museum guide. User Modeling and User-Adapted Interaction18(5), 389–416. 

[3] Sikora, M., Russo, M., Derek, J., & Jurčević, A. (2018). Soundscape of an Archaeological Site Recreated with Audio Augmented Reality. ACM Transactions on Multimedia Computing, Communications, and Applications14(3), 1–22. 

[4] Bubaris, N. (2014) Sound in museums – museums in sound, Museum Management and Curatorship, 29:4,391-402.

[5] Boon, T (2014). ‘Music for Spaces: Music for Space – An Argument for Sound as a Component of Museum Experience’ Journal of Sonic Studies, 8.

[6] Hutchison, M and Collins, L (2009). ‘Translations: Experiments in Dialogic Representation of Cultural Diversity in Three Museum Sound Installations’, Museum and Society, 7/2, pp 92–109, 

[7] Hazzard, A., Spence, J., Greenhalgh, C., & McGrath, S. (2017). The Rough Mile (pp. 1–8). Presented at the 12th International Audio Mostly Conference, New York, New York, USA: ACM Press.

[8] Seidenari, L., Baecchi, C., Uricchio, T., Ferracani, A., Bertini, M., & Bimbo, A. D. (2017). Deep Artwork Detection and Retrieval for Automatic Context-Aware Audio Guides. ACM Transactions on Multimedia Computing, Communications, and Applications13(3s), 1–21.

[9] Bijsterveld, K. (2015). Ears-on Exhibitions: Sound in the History Museum. The Public Historian37(4), 73–90.

[10] Mortensen, C. H., & Vestergaard, V. (2014). Embodied Tuning: Interfacing Danish Radio Heritage. Journal of Interactive Humanities1(1), 23–36. 

[11] Crabtree, A., Rouncefield, M. and Tolmie, P. (2012) Doing Design Ethnography, Springer.

[12] Garfinkel, H. (1967). Studies in Ethnomethodology. Prentice-Hall.

[13] Benford, S., Adams, M., Tandavanitj, N., Row Farr, J., Greenhalgh, C., Crabtree, A., et al. (2013). Performance-Led Research in the Wild. ACM Transactions on Computer-Human Interaction20(3), 1–22.

[14] Unity Technologies (2019). Unity for all. Retrieved August 1, 2019 from:

[15] Firelight Technologies Pty Ltd. (2019). FMOD: Imagine, create, be heard. Retrieved August 1, 2019 from:

[16] PTC (2019). Vuforia Engine 8.1. Retrieved August 1, 2019 from:

[17] Apple Inc. (2019). ARKit. Retrieved August 1, 2019 from:

[18] Google. (2019). ARCore: Build the future. Retrieved August 1, 2019 from:

[19] Naphtali, D. & Rodkin, R. (2020). Audio Augmented Reality for Interactive Soundwalks, Sound Art and Music DeliveryIn:Filimowicz. M. (ed.). Foundation in Sound Design for Interactive Media: A Multidisciplinary Approach. Routledge. New York.

[20] Thielen, E., Letellier, J., Sieck, J., & Thoma, A. (2018). Bringing a virtual string quartet to life (pp. 1–4). Presented at the the Second African Conference for Human Computer Interaction, New York, New York, USA: ACM Press.

[21] Greenhalgh, C. & Benford, S. (1995). MASSIVE: A CoIaborative Virtual Environment for Teleconferencing. ACM Transactions on Computer-Human Interaction, Vol 2, No 3, September 1995, Pages 239-261.

[22] Truax, B. (2012). Sound, Listening and Place: The aesthetic dilemma. Organised Sound17(3), 193–201.

[23] Debord, Guy-Ernest. (1958) Theory of the Dérive. Internationale Situationniste #2. Oakland: Bureau of Public Secrets.

[24] Bull, M. 2000. Sounding Out the City: Personal Stereos and the Management of Everyday Life. Oxford and New York: Berg. 

[25] Montan, N. (2002). AAR: An Audio Augmented Reality System. MA Thesis. KTH, Royal Institute of Technology, Stockholm.

[26] MacIntyre, B., Bolter, J.D., & Gandy, M. (2004). Presence and the Aura of Meaningful Places. 7th Annual International Workshop on Presence (PRESENCE 2004), Polytechnic University of Valencia, Valencia, Spain, 13-15 October 2004. 

[27] Kelly, C. (2019). Material Sound. Murray Art Museum, Albury.


The author is supported by the Horizon Centre for Doctoral Training at the University of Nottingham (RCUK Grant No. EP/L015463/1). The author would also like to thank Annie Jamieson and the rest of the staff at the National Science and Media Museum, Bradford UK.