An immersive virtual multimodal environment for cognitive studies

Amandine Afonso, Alan Blum, Christain Jacquemin, Brian FG Katz

Object

This study, initiated in 2004, presents the combined efforts of three research groups toward the investigation of a cognitive issue through the development and implementation of a general purpose VR environment that incorporates a high quality virtual 3D audio interface.

This document presents the development of the experimental platform used. The platform is based around a generic toolkit and companion architecture that has been developed and used for modeling the environment and interface in a cohesive manner. Details for generating an immersive multimodal experimental environment for this platform are also included.

The psychological aspects of the study concern mechanisms involved in spatial cognition, in particular to determine how a verbal description of an environment or the active exploration of that environment affects the building of a mental spatial representation. Another point is to investigate the role of vision by observing whether or not participants without vision (blind from birth, late blind or blindfolded sighted individuals) can benefit from these two learning modalities.

General Description

In recent years, virtual reality (VR) techniques have developed considerably in the domain of computer sciences. These techniques allow users to interact in as natural as possible a manner with data made available to their sensory experience, visual in the majority of the applications, but also auditory and kinesthetic in immerging research contexts [1][2]. There are great benefits for studies of human behavior with the inclusion of this resource in current paradigms. The special value of VR environments is to allow the investigation of human behavior with people immersed in realistic controlled interactive contexts, without the physical constraints and costs of building such real contexts. VR is an invaluable tool for creating situations that make the study of human behavior easier by expanding the scope of experimental research [3][4].

The work reported here aims at illustrating the capacity of VR as a research tool for the analysis of human cognition and behavior in complex environments. The use of an audio VR platform was of special relevance with respect to the study's purpose of exposing human participants to auditory scenes containing complex sets of spatially organized data, and to allow interaction with elements of the scenes. The immersive character of the VR experience gives participants the sense that the auditory objects they perceive are present in the room and that despite their movements they are within a stable and consistent spatial environment. Through this perceptive stability, and the flexibility of dynamic scene interactions, the VR environment can greatly aid studies in the domain of cognitive/behavioral sciences, in particular in the study of the loop connecting perception, cognition, and action [5].

Previous experiments using 3D audio for sound localization experiments or for building audio interfaces for blind people rely on platforms dedicated to spatialized audio rendering that have little or no graphic output. Whether with real sound sources [6] or within a virtual environment [7], user tracking is necessary and a minimal geometrical model of the scene must be updated in real-time. Even though the same base components exist here (tracking and scene representation), our approach to virtual audio modeling is different because it relies on a tool with full capabilities for multimedia 3D effects, behavioral modeling, and interaction: Virtual Choreographer (VirChor) [8]. The reason for this choice is due to the complexity of our experimental setup that requires the experimenter to monitor accurately the location of the participant and active audio sources. In addition, the complexity of the protocol and its progressive definition through real experiments called for an open scripting language that could be easily modified on-site. In this section, we present both the architectural and software designs for rendering spatialized sound and graphics as well as controlling the interface behavior in response to participant and experimenter inputs.

System Overview

The fundamental requirements of the experiment are an accurate representation of a sonic scene with which the participant can navigate and interact. The majority of audio components in virtual/augmented reality environments are relatively limited in their quality and resolution. Many systems still implement only stereo panning of sound sources. Most current implementations of 3D graphical rendering used in games and virtual or augmented environments for collaborative work would benefit from a richer sonic rendering. We propose here a distributed system in which the scene graph and the audio rendering are handled seperately by well-suited software.

While the context of the study is based around a purely auditory environment, there is an inherent geometry associated. The various sound sources can only be correctly spatialized in a geometrical framework. The physical space must be represented. The positions of the participant and sound sources must be constantly updated within the 3D geometry in order to maintain the correct relative locations of the sound sources with respect to the participant within the environment.

The multimedia scene in which the experiment takes place consists of a room (both physical and virtual) in which virtual sound objects are located. The participant is equipped with a head-tracker device, mounted on a pair of stereophonic headphones, as well as a handheld tracked pointing device. For real-time experimental control the experimenter is aided by a visual feedback of the entire scene that mirrors the current status of the internal representation: active sound sources and their locations, location of the participant in the virtual scene. The experimenter controls the course of the experiment and can constantly verify the status of the system on a computer display, an example of which is shown in Figure 1, with a schematic overview of the experiment test room. The left panel shows the current subjective view of the participant and the right panel of the figure presents an overview of the room. The scene consists primarily of the six sound sources (represented by numbered spheres, where red spheres indicate that the sound source is active), the participant (head), and the pointing device (arrow). The reference scene consists of the spheres located on a circle, also visible in the display. Even though the participant only experiences the auditory component of the model, the actual experimental room has been modeled (and photo texture mapped). This allows the experimenter to better interpret the participants placement and orientation in the scene. In addition, collision detection is used to warn the participants (through an auditory alert) if they approach the boundaries of the physical room or the limits of the tracking system.

PRECISIONS A AJOUTER CONCERNANT CETTE IMAGE POUR LES NON-VOYANTS PRECISIONS A AJOUTER CONCERNANT CETTE IMAGE POUR LES NON-VOYANTS
Figure 1. Screenshot of the combined subjective and overview displays).Schematic view of actual experimental room (shown on a meter scale).

The general flow of information within the system is shown in Figure 2. The six degree-of-freedom (6 DOF) tracking system is polled by Max/MSP [9] for the current position of the participant. The positional information is then passed to the modeler, VirChor. After integrating the external positional updates, experimenter controls, and internal interactions, VirChor sends updated relative source positions (spherical coordinates in the participant's reference frame according to the subjective view in Figure 1) and audio controls to Max/MSP. These parameters are then used to control the audio rendering. The spatialized audio is finally delivered to the participant via headphones.

PRECISIONS A AJOUTER CONCERNANT CETTE IMAGE POUR LES NON-VOYANTS
Figure 2. Overview of the architecture.

Scene Model

The key issue in designing the experiment was to provide the participant and the experimenter with a reactive interface that would implement a scenario of multiple stages with tasks of orientation and localization. In addition, the control of the experiment had to be restrained to a set of minimal operations in order to avoid burdening the experimenter with complex control procedures.

The experimental setup was installed in an existing room, as shown in Figure 1. The experimental room was approximately 4x6 m of which the majority was accessible by the participants. A MIDI interface, used by the experimenter to alter the current state of the experiment (via changes to the internal states of the spheres) is indicated on the figure, as well as the location of the visual feedback screen and machine room outside the experimental room. The configuration of the reference virtual sound scene, central reference point, and physical reference point (chair) used during the experiment are also shown.

VirChor uses an XML syntax for scene modeling. The scene graph structure within VirChor is based on the concept of a unique and cohesive hierarchy framework of scene nodes [10]. Nodes can be comprised of properties directed toward rendering (graphical, auditory, etc.) and behavioral scripts. Behavior within VirChor is modeled through internal message exchange between scene nodes or external communication between these elements and networked applications via UDP. For example, a distributed architecture, employing UDP inter-communications, allows the graphical rendering and audio rendering to be performed on separate machines. Message reception by scene nodes can be controlled by internal node states, triggers, cascaded message transmissions, or scene node modification. These messages can be real-time (interactive with the experimenter or participant), scheduled, or mixed as with a launched series of scheduled events. Scene control is performed through partial XML elements which define the parameter updates. The syntax for internal and external communications is identical and straightforwardly derived from XML element syntax.

The scene was made of three types of objects that all belonged to the class of Geometrical objects in VirChor: physical components (walls and floor), collision detection devices (used to create alerts to avoid wall contact), and sound components (sonic spheres). The role of collision detection devices is to respond to a sensor entering their bounding volume and to emit a trigger, here initiating an audio alert. Most of the behavioral capabilities of the scene are located on the sonic spheres. The behavior of the spheres is regulated via their internal states (controlled by the experimenter) and consists of scripts that trigger sound outputs, cascaded sphere activations, random sphere displacements, or user-controlled sphere positioning. Figure 3 shows the basic VirChor messaging architecture and an example of the definition of a sonic sphere: a geometrical textured sphere that carries sound properties. Part of the associated scripting is also provided, showing the cascading of messages based on internal states.

PRECISIONS A AJOUTER CONCERNANT CETTE IMAGE POUR LES NON-VOYANTS PRECISIONS A AJOUTER CONCERNANT CETTE IMAGE POUR LES NON-VOYANTS
Figure 3. Architecture for graphical and sonic rendering and an example VirChor scene script for an audio/graphic object node.

All control messages (internal or external from other applications and from peripherals) are time-stamp logged in order to allow for post-treatment analyses, motion graph plotting, and scene replay. During replay, all of the sound scene is automatically reproduced as all external control events, including the tracking system, are replayed into VirChor which then reacts as during the actual experiment. Time dilation is also possible, allowing variations in replay speed.

Sound Processing Architecture

Sound spatialization was performed using the Max/MSP environment and IRCAM's reverberation and spatialization library, Spat. A passive interface was developed which allows for all audio rendering to be controlled by external communications with VirChor, see Figure 4. It is important to note that the rendering method used, binaural synthesis, is computationally intensive, increasing with the number of sources. To reduce computational load while maintaining scene flexibility, a hierarchical audio scene structure was created which includes provisions for multi-user cooperative environments. The audio scene tree comprises three levels: room, user, and source. This concept makes use of Spat's "shared reverberation" calculation which individually renders the direct sound and early reflection but creates a single reverberant tail. Therefore the calculation of the late reverberation part, which is considered homogeneous, can be done only once for a monaural mix of all active sources signals. Using this scheme, all sources within a given "room" acoustic use a shared reverb. If multiple room acoustics are desired for other users, additional "rooms" must be defined.

PRECISIONS A AJOUTER CONCERNANT CETTE IMAGE POUR LES NON-VOYANTS
Figure 4. Max/MSP UDP interface for external scene control.

The balance between direct and reverberant sound energy is useful in the perception of source distance [11]. It has also been observed that the reverberant energy, and especially a diffuse reverberant field, can negatively affect source localization. As this study was primarily concerned with a spatially precise rendering, rather than a realistic room acoustic experience, the reverberant energy was somewhat limited. Omitting the room effect creates an "anechoic" environment, which is not habitual for most people. It was decided in this study to create a more realistic environment for which the room effect was included. A room effect, characterized by a reverberation time of 2 s, was employed. To counteract the negative effect on source localization, the direct to reverberant ratio was defined as 10 dB at 1 m.

Binaural Synthesis

Binaural synthesis is an audio presentation technique that attempts to present spatially encoded audio directly at the ear canal of the user. Natural spatial encoding is performed by the natural filtering of sound arriving at the ears through the process of diffraction around the torso, head, and complex form of the pinnae. This diffraction can be characterized by the Head Related Impulse Response (HRIR) or equivalently by its Fourier transform, the Head Related Transfer Function (HRTF). HRTFs contain the acoustic information, such as inter-aural time differences and complex spectral cues used by the human auditory system to interpret the location of sound events. The principle is that sound arriving from any direction in space is coded by a specific pair of transfer functions (left and right ear). For a review on spatial hearing one can refer to [12]. Binaural synthesis consists in processing an audio signal by the HRTF for a given position, thus creating the virtual sound sources under headphones. Measurements of an HRTF result in a stereo filter database following a discrete spatial map. Interpolation is normally required for intermediate positions that are not in the database. More details about theses techniques can be found in [13].

Localization under static binaural rendering (no head-tracking) results in several artifacts, the most important ones being front-back confusions (a source spatialized in front of the auditor is perceived as behind, and vice versa) due to ambiguity in inter-aural differences which are symmetric relative to the inter-aural axis. The auditory system can resolve those ambiguities using head movement [6][14]. Dynamic binaural rendering, as implemented in this study, allows the exploitation of head movements (via a head mounted position tracking system) to constantly update the sound scene. One other important point is that HRTFs are dependent on human morphology and therefore an optimal binaural synthesis should be individualized to the user for better localization performances. Adaptation to non-individual acoustic cues seems to be possible [15] but requires an additional learning phase. The solution adopted in this current study was for each participant to select an "optimal" HRTF from an existing database. This procedure consisted in presenting the synthesis of a series of short sound bursts rotating about the head at a fixed distance (first in the horizontal plane, then in the median plane) using a small set of HRTFs. These HRTFs were selected from the LISTEN HRTF measurement database [16] following a perceptually significant statistical reduction procedure. The HRTF chosen by the participant as providing the most realistic source positioning, according to the known path of the sound source, was used. A modified version of Spat was used which allowed for the individualization of inter-aural time delay, based on head circumference, independent of the selected HRTF.

Video Examples

Two video examples are provided to better understand the system in operation. Video clip (1) shows the acutal physical environment in which the virtual reality system was installed, and shows a participant during the course of the exploratory phase of the experiement. Video clip (2) shows the virtual environement during several phases of the experimental procedure. The virtual scene is shown from both the participants subjective view and also the overhead view. The virtual audio scene is also included and is rendered for the participant's position in the scene.

ExcerptExpeCam.mov
(2.1MB)

ExcerptExpeVR.mov
(49MB)

Prospects

This work has put into place an immersive, interactive, open architecture suite of tools which can be easily combined and configured for complex scenarios. This system is expected to be used for further experiments both in psychological studies, but also in other VR&A application developments.

This project falls under the internal LIMSI-CNRS initiative designed to support multi-disciplinary research and more specifically within the transversal action: Virtual ENvironment for Immersive Simulation and Experiments (VENISE).

Additional details concerning the results of the psychological study and the architecture can be found at the following LIMSI links:

References

[1] S. Lambreyand A.Berthoz, "Combination of conflicting visual and non-visual information for estimating actively performed body turns in virtual reality", Intl. J. Psychophysiology, vol. 50, pp. 101-115, 2003.
[2] J.M. Loomis, R.L. Klatzky, and R.G. Golledge, "Auditory distance perception in real, virtual, and mixed environments", Mixed reality: Merging real and virtual worlds, Y. Ohta and H. Tamura (Eds.), Tokyo, Ohmsha, pp. 201-214, 1999.
[3] I. Viaud-Delmon, A. Seguelas, E. Rio, R. Jouvent and O. Warusfel, "3-D Sound and Virtual Reality: Applications in Clinical Psychopathology", Cybertherapy, San Diego, 2004.
[4] I. Viaud-Delmon, L. Sarlat and O. Warusfel, ""Virtual Ventriloquism: Localization of Auditory Sources in Virtual Reality", Proc CFA/DAGA, Strasbourg, May 2004.
[5] M. von der Heyde and H.H. Buelthoff, Perception and action in virtual environments, Cognitive and Computational Psychophysics Department, Max Planck Institute for Biological Cybernetics, Tuebingen, Germany, 2000.
[6] P. Minnaar, S.K. Olesen, F. Christensen, and H. Møller, "The importance of head movements for binaural room synthesis", Proc ICAD, Espoo, Finland, July 29-August 1, 2001.
[7] C. Frauenberger and M. Noisternig, "3D Audio Interface for the Blind", Proc ICAD, Boston University, Boston, MA, July 7-9, 2003.
[8] C. Jacquemin, VirChor. Virtual Choreographer. LIMSI-CNRS. http://virchor.sourceforge.net/.
[9] MAX/MSP. Cycling'74 http://www.cycling74.com/.
[10] E. Kahle, Validation d'un modèle objectif de la perception de la qualité acoustique dans un ensemble de salles de concerts et d'opéras, PhD thesis, Université du Maine, Le Mans, 1995.
[11] J. Blauert, Spatial Hearing, the psychophysics of human sound localization, MIT Press, Cambridge, 1996.
[12] D.R. Begault, 3-D Sound for Virtual Reality and Multimedia, Academic Press, Cambridge, MA, 1994.
[13] F.L. Wightman and D.J. Kistler, "Resolution of front-back ambiguity in spatial hearing by listener and source movement", J. Acoust. Soc. Am., vol. 105, no. 5, pp. 2841--2853, 1999.
[14] A. Blum, B.F.G. Katz, and O. Warusfel, "Eliciting adaptation to non-individual HRTF spectral cues with multi-modal training", Proc CFA/DAGA, Strasbourg, May 2004.
[15] Listen Project. Information Society Technologies Program - IST-1999-20646: http://listen.gmd.de/ LISTEN HRTF Database: http://www.ircam.fr/equipes/salles/listen/.

Relevant publications resulting from this work