The design of affective interfaces such as credible expressive characters in story-telling applications requires the understanding and the modeling of relations between realistic emotions and behaviors in different modalities such as facial expressions, speech, body movements and hand gestures. Until now, most experimental studies of multimodal behaviors related to emotions have considered only basic and acted emotions and their relation with mono-modal behaviors such as facial expressions [1] or body movements [2]. Recent tools [3] facilitate the annotation and the collection of multimodal corpora but raise several issues regarding the study of emotions: How should we annotate multimodal behaviors occurring during emotions? What are the relevant behavioral dimensions to annotate? What are the differences between basic acted emotions and non acted emotions regarding the multimodal behaviors?
EmoTV is an audiovisual corpus featuring 51 video clips of emotionally rich monologues from TV interviews (with various topics as politics, law, sports) that we have collected for studying non acted emotions. We have designed a coding scheme for annotating the context and several dimensions of emotions (categories, activation, valence), both at the level of the whole video clip and at the level of the different emotional segments that each clip contain [4]. The main conclusions of a first annotation phase of the 51 clips were that emotional segments can not be labeled with a single emotion label but rather with a combination of two labels. Furthermore classical schemes used for detailed annotation of communicative multimodal behaviors revealed to be partly inappropriate for non acted emotions [5]. Such parts of the coding scheme were either removed or modified in order to improve the annotation process.
In this page, we describe this new coding scheme that we have designed for annotating multimodal behaviors during real life mixed emotions. This scheme focuses on the annotation of emotion specific behaviors in speech, head and torso movements, facial expressions, gaze, and hand gestures. We do not aim at collecting detailed data on each individual modality or statistically representative models of the relations between emotions and multimodal behaviors. Instead, our goals are to use the annotations produced with this scheme to identify the required levels of representation for realistic emotional behaviors and to explore the coordination between modalities during non acted behaviors observed in individual videos.
We have grounded our coding scheme on requirements collected from both the parameters described as perceptually relevant for the study of emotional behavior, and the features of the emotionally rich TV interviews that we have selected.
The following measures are thus required for the study of emotional behaviors: the expressivity of movements (the number of repetitions, the fluidity, the strength, the speed, and the spatial expansion). the number of annotations in each modality, their temporal features (duration, alternation, repetition, and structural descriptions of gestures), the directions of movements and the functional description of relevant gestures.
We have defined the coding scheme at an abstract level and then implemented it as a XML file for use with the Anvil tool [3]. Each track is annotated one after the other (e.g. the annotator starts by annotating the 1st track for the whole video and then proceeds to the next track). An example of the annotation of multimodal behaviors is provided in Figure 1.
With this new coding scheme, 455 multimodal annotations of behaviors in the different modalities were done by one coder on the 19 emotional segments on 4 videos selected for their multimodally rich content (e.g. expressive gesture) for a total duration of 77 seconds. These annotations have been validated and corrected by a second coder. We developed a software for parsing the files resulting of annotation and for computing measures. It enables to compare the "expressivity profile" of different videos which feature blended emotions (Table 1), similarly to the work done by [6] on expressive embodied agents. For example videos #3 and #36 are quite similar regarding their emotion labels, average intensity and valence (although their durations are quite different).
|
Video
|
#3
|
#36
|
#30
|
|
Duration |
37s
|
7s |
10s |
|
Emotion labels |
Anger (66%)
|
Anger (55%) |
Exaltation (50%) |
|
Intensity 1: min - 5: max |
5 |
4.6 |
4 |
|
Valence |
1 |
1.6 |
4.3 |
|
% head movement |
1st (56%)
|
1st (60%) |
1st (72%) |
|
% torso movement |
2nd (28%)
|
2nd (20%) |
2nd (27%) |
|
% hand movement |
3rd (16%)
|
3rd (20%) |
3rd (0%) |
|
% fast vs. % slow |
Fast |
Fast |
Fast |
|
% hard vs. % soft |
17 vs. 17 |
Hard |
Soft |
|
% jerky vs. % smooth |
Jerky |
Jerky |
Smooth |
|
% expanded vs. % contracted |
Contracted |
Contracted |
Contracted |
We are currently investigating the use of such annotations in a copy-synthesis approach for the specification of expressive embodied agents which can be useful for perceptual valida-tion [7].
Future directions include the annotation of other videos of EmoTV, the validation of the annotations by the computation of inter-coder agreement from the annotations by several coders, and the computation of other relations between 1) the multimodal annotations, and 2) the annotation of emotions (labels, intensity and valence), and the global annotations such as the modalities in which activity was perceived as relevant to emotion.