Improved speaker diarization using speaker identification

Xuan Zhu, Claude Barras, Sylvain Meignier, Jean-Luc Gauvain

Object

This work describes recent advances in speaker diarization with a multi-stage segmentation and clustering system, which incorporates a speaker identification step. This system builds upon the baseline audio partitioner used in the LIMSI broadcast news transcription system. The baseline partitioner provides a high cluster purity, but has a tendency to split data from speakers with a large quantity of data into several segment clusters. Several improvements to the baseline system have been made. First, the iterative Gaussian mixture model (GMM) clustering has been replaced by a Bayesian information criterion (BIC) agglomerative clustering. Second an additional clustering stage has been added, using a GMM-based speaker identification method. Finally a post-processing stage refines the segment boundaries using the output of a transcription system. On the RT-04f and ESTER evaluation data, the multi-stage system reduces the speaker error by over 70% relative to the baseline system, and gives between 40% and 50% reduction relative to a single-stage BIC clustering system.

Description

Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can improve the readability of an automatic transcription by structuring the audio stream into speaker turns and can also be of interest for the indexation of multimedia documents. Following the definition proposed in the NIST 2004 Rich Transcription evaluation and the Technolangue ESTER evaluation [1,2], we consider the task where no a priori knowledge of the speaker's voice or of the number of speakers is provided, thus only a relative, show-internal speaker identification is performed.

Speaker partitioning is a useful preprocessing step for an automatic speech transcription system, since it discards non-speech segments for efficiency and provides data for the unsupervised speaker adaptation of the acoustic models. Our baseline audio partitioning system was developed for the LIMSI English broadcast news transcription system [3]. Segmentation of the signal is first performed by taking the maxima of a local Gaussian divergence measure between two adjacent sliding windows. Each initial segment is used to seed one cluster, and a GMM with diagonal covariance matrix is trained on the segment data. The algorithm alternates Viterbi resegmentation and GMMs reestimation and merging steps with the goal of maximizing an objective function consisting of a likelihood penalized by the number of segments and of clusters. It was shown to provide a high cluster purity (about 96%) and a cluster coverage slightly below 80% on 1996 and 1997 NIST evaluation data. It is completed with a bandwidth and gender labelling.

However the automatic transcription and the diarization tasks have different constraints for the speaker segmentation and clustering. Position of the segment boundaries is more important for the transcription since cutting in the middle of a word will cause transcription errors; but splitting a speaker into several clusters is less a concern than for the diarization task.

We have thus proposed a new multi-stage architecture optimized for the speaker diarization task [4]. The iterative segmentation and GMM clustering has been replaced by an agglomerative clustering based on the Bayesian information criterion (BIC). Each custer ci is modeled by a single Gaussian with a full covariance matrix Σi estimated on the ni acoustic frames of the cluster. The inter-cluster measure is:

ΔBIC = (ni+nj) log|Σ| - ni log|Σi| - njlog|Σj| - λ (d+d(d+1)/2)/2 . log(ni+nj)

where d is the dimension of the feature space. At each step, the two nearest clusters are merged until the ΔBIC becomes positive.

After several iteration of clustering, the amount of data per cluster increases, so more complex models can be used. State-of-the-art speaker recognition methods are thus used to improve the quality of the speaker clustering. Feature warping normalization is performed on each segment [5]. Then, for each gender and bandwidth condition, a matching Universal Background Model (UBM) with 128 diagonal Gaussians is MAP-adapted to the target speakers. A second stage of agglomerative clustering is performed using the cross log-likelihood ratio defined as:

where f(xi|Mj) is the likelihood of the data from cluster ci given the model Mj from cluster cj, and B is the background model. The clustering stops when the cross log-likelihood ratio for any cluster pair is below a threshold δ estimated on development data. In a final post-processing stage, the output of the transcription system is used to filter out short-duration silence segments that are not detected by the initial speech detection step.

Architecture of the diarization system.
Architecture of the baseline partitioning system (to the left) and of the multi-stage diarization system (to the right)

Results and prospects

The speaker diarization performance is measured via an optimum one-to-one mapping between the reference speaker IDs and the hypothesis spaeker IDs. The primary metric is the ovarall speaker diarizartion error rate (DER) which is the sum of the missed, false alarm and speaker error rates. In order to closely analyze the performance of speaker clustering methods, average frame-level cluster purity and cluster coverage are used. Cluster purity is defined as the ratio between the number of frames by the dominating speaker in a cluster and the total number of frames in the cluster. Cluster coverage accounts for the dispersion of a given speaker's data across clusters.

The experiments were conducted on the US English data used in NIST RT-04f [1] and on the French data from the French ESTER broadcast news evaluation [2]. The development database (dev1) used in English RT-04f consists of 6 audio files recorded in February 2001. The RT-04f test database consists of 12 audio files recorded in December 2003. All the audio files last 30 minutes and were extracted from different US television broadcast news shows. The ESTER test database contains 18 audio files from 'France Inter', 'France Info', RFI, RTM, 'France Culture' and 'Radio Classique' radio station, with a large variability in audio file durations (from 10 minutes to 1 hour).

Several configurations were tested for the systems. Unless otherwise specified, the configuration used is the one that provided the best result on development data, i.e.λ=5.5 for c-bic and λ=3.5, δ=0.1 for c-sid and p-asr. As expected, the standard partitioner c-std in its default configuration provides a high purity, but a relatively poor coverage, resulting in a high overall diarization error over 30\% on dev1 data (cf. Table 1). Setting the penalty &alpha and &beta to optimize these values reduces this error below 25%. The c-bic system also provides a high purity, with much better coverage (resp. 97\% and 90\%), reducing the overall error rate by almost 50%. The c-sid system achieves a large increase of the coverage, resulting in a global error rate about 7%, a reduction of almost 50% compared to c-bic system.

system cluster
purity
coverage overall
DER
RT-04f dev1 dataset
c-std (α=β=160) 95.0%71.6%32.3%
c-std (α=β=230) 90.6%82.1%24.8%
c-bic (λ=5.5) 97.1%90.2%13.2%
c-sid (λ=3.5, &delta=0.1) 97.9%95.8%  7.1%
Table 1: Performances of c-std, c-bic and c-sid systems on the RT-04f development data (dev1).

The results on the evaluation data are given in Table 2, with the setting optimized on the development data. On the RT-04f test data, the p-asr system provides an overall diarization error reduction of up to 50% relative to a standard BIC clustering. On the ESTER test data, the overall diarization error was reduced from 13.8% for c-bic system to 11.5% for c-sid system. The post-evaluation experiments illustrate that c-sid system has an even better performance (9.1% overall dirization error) with δ=2.0 on ESTER test data, this result is comparable with the results obtained on RT-04f test data.

system missed
speech
false alarm
speech
speaker
error
overall
DER
RT-04f test dataset
c-bic 0.4%1.8%14.8%17.0%
c-sid (δ=0.1) 0.4% 1.8% 6.9% 9.1%
p-asr 0.6% 1.1% 6.8% 8.5%
c-sid (δ=0.4)* 0.4% 1.8% 6.0%8.2%
ESTER test dataset
c-bic 0.7%1.0%12.1%13.8%
c-sid (δ=1.5) 0.7%1.0%9.8%11.5%
c-sid (δ=2.0)* 0.7%1.0%7.4%9.1%
Table 2: Performances of c-bic, c-sid and p-asr systems on the evaluation data of RT-04f and ESTER (*these are post-evaluation results).

The multi-stage system was demonstrated to perform much better than the baseline audio partitioning system for the speaker diarization task. A relative error reduction of over 70% (from 24.8% for baseline system to 7.1% for c-sid system) was achived on the RT-04f development data. This system obtained the best speaker dizrization performance in both the RT-04f and the ESTER evaluations by a significant margin. This dramatic improvement over the baseline system results from several changes: the combination of the BIC clustering and the clustering based on the state-of-the-art speaker recognition methods, each one focusing on a different acoustic aspect with more complex modeling in the second stage, and the use of aoustic channel normalization methods suited to speaker identification.

The future work will focus on the improvement of the robustness and the efficiency of the system. It was observed that the clustering threshold needs to be tuned according to the duration and the type of the audio duocument, and that the system still has a large variability across individual shows. Only with a large amount of files can statistically consistent results be obtained. Finally, most speaker diarization systems rely on a purely acoustic segmentation and clutering, whereas an essential part of the information in speech is of a linguistic nature, and obviously in TV and radio shows most speakers are presented and identified. Comibining the acoustic information with the linguistic layer would improve the robustness of a speaker diarization system and make the diarization ouput more exploitable by a human reader.

References

[1] NIST (2004). Fall 2004 rich transcription (RT-04F) evaluation plan .
[2] S. Galliano, E. Geoffrois, D. Mostefa, K. Choukri, J.-F. Bonastre and G. Gravier (2004). The ESTER Phase II Evaluation Campaign for the Rich Transcription of French Broadcast News, Proc. InterSpeech'05, pp. 1149-1152.
[3] J.-L. Gauvain, L. Lamel and G. Adda (1998). Partitioning and Transcription of broadcast News Data, Proc. ICSLP'98 .
[4] X. Zhu, C. Barras, S. Meignier, and J.-L. Gauvain (2005). Combining Speaker Identification and BIC for Speaker Diarization, Proc. InterSpeech'05, pp. 2441-2444.
[5] J. Pelecanos and S. Sridharan (2001). Feature warping for robust speaker verification, Proc. ISCA Odyssey Workshop .