Objet
Recently there has been growing interest in using dialog acts to
characterize human-human and human-machine dialogs. In order to capture
the richness of human-human call center dialogs, it is interesting to
explore and correlate dialog features at multiple levels: lexical,
semantic and functional. We are also interested in automatically
modeling discourse structure in order to develop more sophisticated
spoken dialog systems. We have been working on automatic detection of
dialog acts in the Amities corpus [1]. A Memory Based Learning
methodology
was adopted since it works well with small amounts of data and it has
been shown to be well adapted for natural language processing. In this
approach, the feature vectors of the test data are compared to those in
the training data. The features include the speaker, the number of
utterance units in the turn, the previous (hypothesized) dialog acts
and
N tag words per
utterance unit. For this, our automatic tagger has
been used.
Description
Corpus Description
The main corpus (GE_fr)
used
in this study consists of 134 agent-client
dialogs in French recorded at a bank call center service. The dialogs
cover a range of investment related topics such as information requests
(credit limit, account balance), orders (change the credit limit) and
account management (open, close, modify personal details). The
application domain is structured into 6 major topics, hierarchically
organized into 45 sub-topics. These dialogs were orthographically
transcribed with Transcriber, a tool for segmenting, labeling and
transcribing speech~\cite{trans01}. This corpus was divided into 2 sets
for training (containing 94 dialogs, 2923 turns, 3912 utterance units)
and testing purposes (containing 40 dialogs, 1350 turns, 1711 utterance
units). For the second part of our experiments, we used two other
corpora: the first one, CAP_fr
(24 dialogs, 1025 turns, 1203
utterance units), consists of agent-client recordings in French from a
Web-based Stock Exchange Customer Service center. While many of the
calls concern problems in using the Web to carry out transactions
(general information, complicated requests, transactions,
confirmations,
connection failures), some of the callers simply seem to prefer
interacting with a human agent. The dialogs cover a range of investment
related topics such as information requests (services, commission fees,
stock quotations), orders (buy, sell, status), account management
(open,
close, transfer, credit, debit) and Web questions/problems. The second
one, GE_eng (31 dialogs, 1147
turns, 1357 utterance units), consists
of agent-client dialogs in English recorded at a bank call center
service. The dialogs cover essentially the same investment related
topics as the GE_fr corpus.
Dialog segmentation and annotation
A dialog can be divided into units called turns, in which a single
speaker has temporary control of the dialog and speaks for some period
of time.
Within a turn, the speaker may produce several
utterances
units
where the definition of an utterance unit is based on an analysis of
the
speaker's intention (the dialog acts).
Once a turn is segmented into
units, these have to be annotated in dialog acts (see fig. 1)
Fig. 1: Utterance unit segmentation and dialog act
annotation
The taxonomy is the
one adopted in the Amities project. In this study, the dialogic tags
are
classified into eight dimensions to allow multiple tags to be specified
for each utterance unit (if no tag is relevant it is represented by NA
(not applicable)):
- Class 1 (Information Level):
characterizes the
semantic content of the utterance unit. The different tags are
Communication-mgt, Out-of-topic, Task, Task-management-Completion,
Task-management-Order, Task-management-Summary,
Task-manageent-System-Capabilities.
- Class 2 (Statement):
makes a claim
about the world, and tries to change the beliefs of the listener. The
different tags are Assert, Commit, Explanation, Expression,
ReExplanation, Reassert.
- Class 3 (Conventional):
refers to utterance
units which initiate or close the dialog. The different tags are
Closing
and Opening.
- Class 4 (Influence on Listener):
In this group of tags,
the speaker is asking the listener a question, directing him or her to
do something, or suggesting some course of action the listener may
take.
The different tags are Action-directive, Explicit-Confirm-request,
Explicit-Info-request, Implicit-Confirm-request, Implicit-Info-request,
Offer, Open-Option, Re-Action-directive, Re-Confirm-request,
Re-Info-request, Re-Offer.
- Class 5 (Agreement):
indicates whether the
speaker accepts a proposal, offer or request, or confirms the truth of
a
statement or confirmation-request. The different tags are Accept,
Accept-part, Maybe, Reject, Reject-part.
- Class 6 (Answer): is a
response to an Information-request or Confirmation-request. An answer
by
definition will always be an assertion, as it provides information or
confirms a previous supposition, and it makes a claim about the world.
Therefore only one tag is used: True.
- Class 7 (Understanding):
reveals
whether and in what way the speaker heard and understood what the other
speaker was saying. The different tags are Backchannel, Completion,
Correction, Non-understanding, Repeat-rephrase.
- Class 8 (Communicative
Status): refers to the features of the communication. The
different tags
are AbandStyle, AbandTrans, AbandChangeMind, AbandlossIdeas,
Interrupted, Self-talk.
Even if the number of possible tag combinations is huge (1,016,064),
only 197 are observed in the 3912 training utterance units. Six of them
represent 51% of the corpus. For example, if the Class1 tag is Task
(52%), then the Class2 tag is either NA (26%) or Assert (26%), and
Class3 is NA (see fig. 3). There is a strong predictive factor in the
succession class tags in the utterance unit.
Combination of dialog acts succession
Then, this work is based on three hypotheses:
- The Dialog Act succession is strongly constrained:
- The initial words are more important than the remaining words in
identifying the dialog act for example I'd like... can you give me...
- The information is encoded in specific entities:
- Named Entities which are expressions for people, places,
organizations<.li>
- Task Entities which are named entities which describe task
or domain specific knowledge such as account number, account amount
- Linguistic Entities which give structure to the utterances,
for example I'd like to...
Methodology for automatic annotation
All the data have been automatically tagged with specific
entities. This tagging is done in two steps: the first one is language
dependent but task independent and consists of automatic tagging of
named entities. The second one is language independent but task
dependent, consisting of task entity detection. These taggers use
rewrite rules which work like local grammars and with specific
dictionaries. They replace the specific entities by tag words
expressing their types. First, the turn is tagged. Each speaker turn
is input independently to the system. The N first words of each
utterance are used as lexical features. The number of utterance units
in the turn is used as additional information. All these features are
put in a vector and the dialog act for the first dimension is
predicted using memory based learning (MBL), more specifically the
Timbl implementation [2], since it works well with small amounts of
data and it has been shown to be well adapted for natural language
processing. We use the Manhattan distance, where the distance between
two patterns is simply the sum of the differences between the
features. MBL works by finding the vector in the training database
closest to the test one. Two differents are built, the first one for
the Agent and the second one for the Client.
The Fig. 4 schematically represents the dialog act classification
method.
Fig. 4: Dialog Acts Classification
The result of this
first prediction is considered as an element of the vector used to
predict the next dialog act. After the utterance has been classified
for all 8 dialog act dimensions, if there is more than one utterance
unit in the turn, the
N next
words of the utterance are added to the vector
containing the hypotheses for the previous utterance unit.
For example, the training turn
Agent:
donnez -moi votre numéro de compte (give me your account number)
having the following dialog acts tags:
DAs: information-level=Task; influence-on-listener=Action-directive
is represented for the first dialog act prediction by the following
vector in the Agent Vector Database:
[1 donnez -moi votre numéro]
If the prediction for the first dialog act is Task,
then the vector for the prediction of the second dialog act is:
[1 donnez -moi votre numéro Task]
Results and Prospects
To test the hypothesis further, the models trained on GE_fr corpus
were applied to the CAP_fr corpus (a change of task) and to the GE_eng
corpus (a change of language). An error rate of dialog act detection
of about 16% is obtained for the same domain and language
condition. and about 25% for the cross-language and cross-domain
conditions with this basic system [3]. Our hypothesis is that it is
likely that other sources of information such as the dialog history
could also be useful to predict dialog acts. The experiments using
historical information are based on two hypotheses: the first one is
that there are relations between the different utterance units in one
turn and that these relations are organized; the second one is that a
dialog being a succession of turns, the dialog acts of a turn have an
incidence on the dialog acts of the next turn. We tried different
sizes of dialog history. The best results were obtained with the
following combination: For the first two utterance units, the dialogic
information of the last utterance unit of the previous turn was
used. The third utterance unit of the current turn is considered as a
first utterance unit and no previous history information is added to
the vector. An error rate of dialog act detection of about 12.3% is
obtained in same domaine and language condition. For the cross domain
condition the error rate is about 20% and 19.5% for the cross language
condition [4].
References
[1] .
AMITIES Project
[2] . Daelemans,
J. Zavrel, K. van der Sloot, A. van den Bosch
(2003)
ILK Technical Report ILK-03-10
,
TiMBL: Tilburg Memory Based Learner, v5.0,
Reference Guide,
[3] S. Rosset, L.
Lamel (2004).
Automatic Detection of Dialog Acts Based on
Multi-level Information, Proc. of
ICSLP'04.
540-543.
[4] S. Rosset, D.
Tribout (2004)
Multi-level Information and Automatic dialog
Acts Detection in human-human Spoken Dialogs, Proc. of Interspeech'05.
2789-2792.