Automatic annotation of dialog acts in human-human dialog corpora

Sophie Rosset, Delphine Tribout

Objet

Recently there has been growing interest in using dialog acts to characterize human-human and human-machine dialogs. In order to capture the richness of human-human call center dialogs, it is interesting to explore and correlate dialog features at multiple levels: lexical, semantic and functional. We are also interested in automatically modeling discourse structure in order to develop more sophisticated spoken dialog systems. We have been working on automatic detection of dialog acts in the Amities corpus [1]. A Memory Based Learning methodology was adopted since it works well with small amounts of data and it has been shown to be well adapted for natural language processing. In this approach, the feature vectors of the test data are compared to those in the training data. The features include the speaker, the number of utterance units in the turn, the previous (hypothesized) dialog acts and N tag words per utterance unit. For this, our automatic tagger has been used.

Description

Corpus Description

The main corpus (GE_fr) used in this study consists of 134 agent-client dialogs in French recorded at a bank call center service. The dialogs cover a range of investment related topics such as information requests (credit limit, account balance), orders (change the credit limit) and account management (open, close, modify personal details). The application domain is structured into 6 major topics, hierarchically organized into 45 sub-topics. These dialogs were orthographically transcribed with Transcriber, a tool for segmenting, labeling and transcribing speech~\cite{trans01}. This corpus was divided into 2 sets for training (containing 94 dialogs, 2923 turns, 3912 utterance units) and testing purposes (containing 40 dialogs, 1350 turns, 1711 utterance units). For the second part of our experiments, we used two other corpora: the first one, CAP_fr (24 dialogs, 1025 turns, 1203 utterance units), consists of agent-client recordings in French from a Web-based Stock Exchange Customer Service center. While many of the calls concern problems in using the Web to carry out transactions (general information, complicated requests, transactions, confirmations, connection failures), some of the callers simply seem to prefer interacting with a human agent. The dialogs cover a range of investment related topics such as information requests (services, commission fees, stock quotations), orders (buy, sell, status), account management (open, close, transfer, credit, debit) and Web questions/problems. The second one, GE_eng (31 dialogs, 1147 turns, 1357 utterance units), consists of agent-client dialogs in English recorded at a bank call center service. The dialogs cover essentially the same investment related topics as the GE_fr corpus.

Dialog segmentation and annotation

A dialog can be divided into units called turns, in which a single speaker has temporary control of the dialog and speaks for some period of time.  Within a turn, the speaker may produce several utterances units where the definition of an utterance unit is based on an analysis of the speaker's intention (the dialog acts). Once a turn is segmented into units, these have to be annotated in dialog acts (see fig. 1)


Fig. 1: Utterance unit segmentation and dialog act annotation


The taxonomy is the one adopted in the Amities project. In this study, the dialogic tags are classified into eight dimensions to allow multiple tags to be specified for each utterance unit (if no tag is relevant it is represented by NA (not applicable)):


Even if the number of possible tag combinations is huge (1,016,064), only 197 are observed in the 3912 training utterance units. Six of them represent 51% of the corpus. For example, if the Class1 tag is Task (52%), then the Class2 tag is either NA (26%) or Assert (26%), and Class3 is NA (see fig. 3). There is a strong predictive factor in the succession class tags in the utterance unit.

Combination of dialog acts succession
Combination of dialog acts succession
Then, this work is based on three hypotheses:

Methodology for automatic annotation

All the data have been automatically tagged with specific entities. This tagging is done in two steps: the first one is language dependent but task independent and consists of automatic tagging of named entities. The second one is language independent but task dependent, consisting of task entity detection. These taggers use rewrite rules which work like local grammars and with specific dictionaries. They replace the specific entities by tag words expressing their types. First, the turn is tagged. Each speaker turn is input independently to the system. The N first words of each utterance are used as lexical features. The number of utterance units in the turn is used as additional information. All these features are put in a vector and the dialog act for the first dimension is predicted using memory based learning (MBL), more specifically the Timbl implementation [2], since it works well with small amounts of data and it has been shown to be well adapted for natural language processing. We use the Manhattan distance, where the distance between two patterns is simply the sum of the differences between the features. MBL works by finding the vector in the training database closest to the test one. Two differents are built, the first one for the Agent and the second one for the Client.

The Fig. 4 schematically represents the dialog act classification method.



Fig. 4: Dialog Acts Classification
The result of this first prediction is considered as an element of the vector used to predict the next dialog act. After the utterance has been classified for all 8 dialog act dimensions, if there is more than one utterance unit in the turn, the N next words of the utterance are added to the vector containing the hypotheses for the previous utterance unit. 

For example, the training turn

Agent: donnez -moi votre numéro de compte (give me your account number)

having the following dialog acts tags:

DAs: information-level=Task; influence-on-listener=Action-directive

is represented for the first dialog act prediction by the following vector in the Agent Vector Database:

[1 donnez -moi votre numéro]

If the prediction for the first dialog act is Task, then the vector for the prediction of the second dialog act is:

[1 donnez -moi votre numéro Task]

Results and Prospects

To test the hypothesis further, the models trained on GE_fr corpus were applied to the CAP_fr corpus (a change of task) and to the GE_eng corpus (a change of language). An error rate of dialog act detection of about 16% is obtained for the same domain and language condition. and about 25% for the cross-language and cross-domain conditions with this basic system [3]. Our hypothesis is that it is likely that other sources of information such as the dialog history could also be useful to predict dialog acts. The experiments using historical information are based on two hypotheses: the first one is that there are relations between the different utterance units in one turn and that these relations are organized; the second one is that a dialog being a succession of turns, the dialog acts of a turn have an incidence on the dialog acts of the next turn. We tried different sizes of dialog history. The best results were obtained with the following combination: For the first two utterance units, the dialogic information of the last utterance unit of the previous turn was used. The third utterance unit of the current turn is considered as a first utterance unit and no previous history information is added to the vector. An error rate of dialog act detection of about 12.3% is obtained in same domaine and language condition. For the cross domain condition the error rate is about 20% and 19.5% for the cross language condition [4].

References

[1] . AMITIES Project
[2] . Daelemans, J. Zavrel, K. van der Sloot, A. van den Bosch (2003) ILK Technical Report ILK-03-10 , TiMBL: Tilburg Memory Based Learner, v5.0, Reference Guide,
[3] S. Rosset, L. Lamel (2004). Automatic Detection of Dialog Acts Based on Multi-level Information, Proc. of ICSLP'04. 540-543.
[4] S. Rosset, D. Tribout (2004) Multi-level Information and Automatic dialog Acts Detection in human-human Spoken Dialogs, Proc. of Interspeech'05. 2789-2792.