SPEAKER IDENTIFICATION AND VERIFICATION
(from LIMSI 1995 Scientific Report, March 1995)
The experiments on the telephone corpus were carried out in collaboration with the Vecsys company in the context of a contract with France-Telecom.
J.L. Gauvain, L.F. Lamel, B. Prouts
Speaker verification has been the subject of active research for many years, and has many potential applications where propriety of information is a concern. Our studies assess performance levels for both high quality speech and telephone speech and for two operational modes, i.e. text-dependent and text-independent speaker verification.
A statistical modeling approach is taken, where the talker is viewed as a source of phones, modeled by a fully connected Markov chain. The lexical and syntactic structures of the language are approximated by local phonotactic constraints, and each phone is in turn modeled by a 3 state left-to-right HMM. For text-independent identification, this provides a better model of the talker than can be done with simpler techniques such as long term spectra, VQ codebooks, or a simple Gaussian mixture. When applied to speaker identification a set of phone models is trained for each speaker and identification of a speaker from the signal is performed by computing the phone-based likelihood for each speaker , the speaker identity corresponding to the model with the highest likelihood is then hypothesized. This approach has been shown to be successful not only for speaker identification but also for gender and language identification. When the same speaker model is applied to speaker verification, and the likelihood ratio is compared to a speaker independent threshold in order to decide acceptance or rejection.
The Viterbi algorithm is used to compute the joint likelihood of the incoming signal and the most likely state sequence instead of . This implementation is thus a modified phone recognizer where the output phone string is ignored and only the acoustic likelihood is taken into account. Maximum a posteriori (MAP) estimators are used to build speaker-specific models from a set of speaker-independent models. The speaker-independent seed models provide estimates of the parameters of the prior densities and also serve as an initial estimate for the segmental MAP algorithm, allowing a large number of parameters to be estimated from a small amount of adaptation data.
Two corpora have been used for experiments: the BREF corpus which is used to calibrate the algorithm on high quality speech, but was not designed to perform speaker recognition experiments; and a telephone speech corpus which is presently being recorded over dialed-up telephone lines and has been especially designed to evaluate speaker recognition algorithms. For this second corpus each target speaker is recorded for multiple calls over a period of several months.
Speaker-specific phone models for each target speaker have been trained on about 75 sentences (coming from the same session for the BREF corpus, and from 2 recording sessions for the telephone speech corpus) for 50 speakers from BREF and 45 speakers from the telephone corpus. On the BREF corpus, the text-free identification rate is 99.9% using 4s of speech per trial and a maximum of two trials per validation attempt. In verification mode, the a posteriori equal error rate (the false acceptance and false rejection rates are the same) is 0.2% in text independent mode when two verification attempts are allowed.
The results of the verification experiments on the telephone corpus are shown in Figure 1. In text dependent mode the equal error rate is 3.5% with 4s of speech per trial and a maximum of two trials per 'authentification attempt.
Figure 1. ROC (Receiver Operating Characteristics) curves for different model types and operational modes for the telephone data: (a) Baseline multi-Gaussian model using a single mixture of 32 Gaussians per speaker; (b) Phone-based approach using 35 phone models, text independent verification mode; (c) Phone-based approach using 35 phone models, text dependent verification mode; (d) identical to (c) with 2 trials (When 2 verification trials are authorized for target speakers and impostors, the average number of attempts is 1.1.); (e) identical to (d) with exactly 4s of speech. The dotted line shows the points of equal error (false acceptance/false rejection).
Comparing the equal error rates (with 1 only trial per attempt and an average of 4.1s of speech per trial), we can observe that the phone-based approach in text-independent mode performs significantly better than the Gaussian mixture model (7.3% v.s 9.0% EER), and that knowing the text reduces the EER to 5.1%. Allowing 2 trials per attempt reduces the EER to 4.4% and requiring a fixed minimum amount of 4s of speech (as in the experiment on the BREF corpus) reduces the error rate to 3.5%.