Transcribing Lectures and Seminars

Lori Lamel, Gilles Adda, Claude Barras, Eric Bilinski, Jean-Luc Gauvain, Gary Leung, Holger Schwenk, Xuan Zhu

Object

This work aims to provide support for on-line and off-line services for indexing lectures and seminars as part of the EC FP6 Integrated Project Computers in the Human Communication Loop (CHIL) which is exploring new paradigms for human-computer interaction. Automatic methods can generate a wealth of annotations, enabling users to search the audio data to find talks on specific topics or by certain speakers.

Description

In order to develop services that are provided in an unobtrusive manner so as to suit human needs, the CHIL partners are developing robust, multi-modal perceptual user interfaces which can track and identify people, recognize what they are doing and take appropriate actions based on the context. To use the project terms, the goal is to models the ``Who, Where, What, Why and How of Human activities and communication.'' At LIMSI we are developing technologies for audio based speech activity detection, speaker recognition and tracking (Who and Where), automatic speech recognition and the extraction of linguistic meta-data (What), topic detection and emotion recognition (How and Why). The general task of transcribing lectures and seminars is a challenging one, combining the difficulties encountered in the processing spontaneous speech and the difficulties of far-field speech recognition.

There are different technological constraints for on-line and off-line services. For on-line services, the lecture must be transcribed and annotated in close to real time, while the lecture is happening. Such an interactive application would allow latecomers to catch up on what was already presented earlier in the talk, by either reading the transcript or an automatically created summary. If someone needs to step out of the lecture for a few minutes, the service would allow the person to scan the missing portion. There are many envisagable off-line applications which could benefit from automatic transcription, annotation, indexing and retrieval. These technologies could be used to archive all public presentations (conferences, workshops, lectures) for future viewing and selected access. Automatic techniques can provide a wealth of annotations, enabling users to search the audio data to find talks on specific topics or by certain speakers. Given the large number of parallel oral sessions at most major conferences, such services could allow attendees to interactively access talks they were unable to attend.

A transcription system for lectures and seminars for off-line applications was developed, using primarily publicly available corpora for acoustic and language model training, including the ICSI, ISL, ISL meeting corpora (distributed by LDC) and the TED recordings of presentations at Eurospeech'93 in Berlin [1]. The acoustic models were estimated on a total about 97 hours of audio data. The language models were trained on about 35.4M words of proceedings texts and 1M words of manual transcriptions of the audio training data.

Results and prospects

The recognition word list was selected from the audio transcripts and the proceedings texts. There are 20k distinct words in the audio transcripts, which results in an out -of-vocabulary (OOV) rate of about 1.3%. By adding words from the textual sources to form a 35k wordlist reduced the OOV rate to about 0.2%. For comparison, a 65k broadcast news wordlist had an OOV rate of 6-8% the test data. Bigram, trigram and fourgram language models were estimated on each of the four text sources and interpolated, with interpolation weights of about 0.3 for the texts and 0.1 for CTS. The resulting perplexities with the 4-gram language model of the jun04 and jan05 are 97.6 and 107.1 respectively. The most important text contribution comes from the CHIL development transcriptions, which give a large drop in perplexity particularly for the jun04 data (Without these transcripts, the 4-gram perplexities of the jun04 data is 127.6 and of the jan05 data is 122.1).

Experimental results are reported on two sets of ISL seminars. The seminars were recorded with both near and far-field microphones, including a microphone array. The first set is comprised of recordings from 7 seminars (7 different speakers, all with German accents) was used in the June 2004 early evaluation. Each seminar was split into four 5-minute segments, 2 for development and 2 for test. The development and test subsets each contain 1.2 hours of speech. The second set, used for the Jan 2005 technology benchmark, is comprised of five seminars (5 different speakers, with German, American, Italian and Indian accents). Two of the five seminars were split into development and test portions, and the remaining three were only used for testing purposes. For this test there is about 0.75h of development data and 2.1 hours of test. Speech recognition tests were carried out on both the close-talking microphone (CTM) data and far-field microphone data with manual segmentations. For the far-field task, the data from the individual microphone channels could be used, as well as the result of a delay-and-sum beam-forming performed at UKA [2].

The overall results are summarized in Table 1 for the two data sets. On the Jun04 data, the CHIL primary system obtained a word error rate of 26.2%, compared with the 42.2% obtained with the LIMSI RT04 BN transcription system. The effect of adding just a small amount of speech (1 hour total) from the test speakers to the almost 100 hours of other data can be seen by comparing the CTM primary and no-dev systems. There is a 13% relative gain on the Jun04 data where all seminars had specified development segments, and 9% on the Jan05 data where only two of the five seminars had development portions. The last entry gives with word error rate on the beam-formed microphone array data, which is about twice that obtained on the close-talking microphone data.

SystemJun04Jan05
RT04 BN - 42.2
CTM, primary 26.223.6
CTM, no dev 30.226.0
Beam, primary57.651.9

Table 1: Overall word error rates on the Jun04 and Jan05 data.

We are also addressing the problem of Speech activity detection (SAD) which is a useful preprocessing step prior to further processing such as automatic speech recognition, speaker identification and verification, speaker localization etc. SAD is performed using two Gaussian mixture models (GMMs), respectively for speech and non-speech. A Viterbi decoder then provides the segmentation for the speech/non-speech labeling. The balance between Speech Detection Error Rate (SDER) and Non-speech Detection Error Rate (NDER) is reached using specific transition penalty between models.

With the standard LIMSI SAD system used for Jun04 evaluation a low SDER of 3.3% was obtained on the close-talking microphone signal but a high NDER as shown in Table 2. For the Jan05 evaluation, new GMMs were trained on available meeting data (ICSI, ISL, NIST). This system has a 35% relative reduction of Average Detection Error Rate (ADER) for the CTM data compared to the original system. However, no improvement in ADER is observed on the farfield data; better matched training data are needed to improve performance for the farfield condition.

SystemChannel ConditionSDERNDERADER
Jun04 CTM 3.3 35.3 19.3
Jan05 CTM 8.4 16.0 12.2
Jun04 ARR 24.8 4.9 14.9
Jan05 ARR 14.6 15.5 15.1

Table 2: Speech activity detection error rates on the Jun04 and Jan05 data.

The purpose of Acoustic Speaker Recognition in the CHIL project is to recognize the identity of speakers, mainly the presenters of seminars. For the experiments, 15 seminars were available, and the presenters of the seminars were the target speakers for the speaker identification task. The speaker recognition system is a standard GMM-based system. A gender-independent universal background model (UBM) with 2048 Gaussian mixtures was trained using 7 hours of data from the ICSI, ISL, NIST meeting and the TED speeches corpora. Each target speaker model was trained by maximum a posteriori (MAP) adaptation of the Gaussian means of the UBM.

Experimental results in speaker identification are given Table 3. for different combinations of duration and channel conditions. An obvious degradation in identification performance due to the channel mismatch is observed. As expected, the performance is generally better when the speech duration is increased, and on the close-talking microphone than on the microphone array.

Test duration (sec)
Train duration (sec)Train data Test data 60301051
60 CTM CTM 0.0 0.0 0.1 1.7 17.6
60 CTM ARR 14.5 17.3 37.2 47.1 70.9
60 ARR CTM 20.9 19.0 15.9 23.0 55.5
60 ARR ARR 9.1 8.2 12.9 17.2 44.0
30 CTM CTM 0.0 0.0 0.1 2.0 19.2
30 CTM ARR 16.4 20.9 33.7 43.0 67.7
30 ARR CTM 13.6 14.3 18.6 28.7 63.7
30 ARR ARR 1.8 0.0 4.7 11.3 52.8

Table 3: Speaker identification errors (in %) .

The general task of transcribing lectures and seminars is a challenging one, combining the difficulties encountered in the processing spontaneous speech and the difficulties of far-field speech recognition. It is our belief that most of techniques which improve recognition of CTM data will also improve far-field speech recognition. Future work will investigate automatic partitioning of the data into speaker turns and multi-microphone training to improve the far-field recognition.

References

[1] L.F. Lamel and F. Schiel and A. Fourcin and J.J. Mariani and H. Tillmann (1994). The Translanguage English Database (TED), ICSLP'94, 4:1795-1798, Yokohama, September .
[2] D. Macho, J. Padrell et al., (1992). First experiments of automatic speech activity detection, source localization and speech recognition in the CHIL project, Workshop on Hands-Free Speech Communication and Microphone Arrays, Rutgers University, Piscataway, NJ, 2005.
[3] L. Lamel, H. Schwenk, J.L. Gauvain, G. Adda, E. Bilinski (2005). Improvements in Transcribing Lectures and Seminars, Proc. MLMI'05, Edinburgh, July.
[4] L. Lamel, G. Adda, E. Bilinski, J.L. Gauvain (2005). Transcribing Lectures and Seminars, Proc. Eurospeech'05, Lisbon, September.