The principle behind the phonological rules is to modify the phone network to take into account phonological variations. The rules are applied during both training and recognition and are always optional. Using optional phonological rules during training results in better acoustic models, as they are less ``polluted'' by wrong transcriptions. Their use during recognition reduces the number of mismatches. The mechanism for the phonological rules allows the potential for generalization and extension. However, care must be taken as the alternate pronunciations thus generated can cause errors especially for short words when the rules are applied abusively. The use of phonological rules for the RM task has been previously reported by SRI and AT&T. In the case of AT&T, phonological rules were used only with CI phone models.
Phonological rules were added cautiously, avoiding multiple pronunciations for very short words, deleting phones in short words (2-3 phones), or creating homophones. All the added phones are optional, and phones can be optionally deleted in long words. The phonological rules are applied to the phone graph generated from the baseline lexicon by adding skip arcs to optionally delete phones and adding phone models for alternate pronunciations and inserted phones. The resulting phone model graph which is only 12%larger than the original, is used during training and testing.
Some examples of the phonological rules are given in Table 5. These include general rules for well known variants such as palatalization, glide insertion and gemination, as well as rules to handle allophonic variation, using only the reduced phone set. So, instead of having a syllable or word final allophones for the voiceless stops, they are optionally allowed to be replaced with their voiced counterparts. There are more specific rules, such as the deletion of the offglide /w/ in the phone sequence /aw/, as found in the word ``how.'' While this is a fairly general phenomenon, in the context of RM this rule becomes very specific for the word sequences ``how much'' and ``how many.''
Figures 1 and 2 illustrate some acoustic differences motivating the use of phonological rules, taken from the training data. The speaker code is given by the three letters in parenthesis. In Figure 1 are examples of acoustic realizations at vowel-vowel word boundaries, where it is common to insert either a glide or a glottal stop to mark the boundary. The left most example has a /y/-insertion marking the boundary between in ``the average'', giving the phone sequence /iy@/. The same speaker, however, uses a glottal stop to mark the boundary in ``the AAW'', even though the phonetic environment is very similar. The semivowels /r,w/ may be inserted in the same way; an example of a /w/-insertion is seen in the right most spectrogram in the word sequence ``do any.''
Figure 2 shows some of the variability observed in the realization of stops. The left two spectrograms were taken from the same sentence, and show that even in a similar context, the acoustic realization can be very different. The final /t/ in ``chart of'' is manifest as a glottal stop, where as the final /t/ in ``start at'' is flapped. The spectrogram on the right shows that the final /k/ in ``pacific ocean'' is produced as a /g/. One could argue that this should be considered a speech error, however, the word string is perfectly understood.
Given that even a single speaker may mark phonetic distinctions in different ways, even in a similar phonetic environment, indicates that the use of CD phones as they are typically defined, even if they are word position dependent, will still combine allophones which are acoustically very different. (This distinction was refered to as hard vs soft by Giachin et al..) Therefore, it seems obvious that the use of phonological rules during training will result in purer acoustic models, which should improve the system performance.
The effects of these developmental changes are summarized in Table 6 for four test sets using sex-dependent models. Phase 1 is prior to the use of the speaker-dependent test data, and Phase 2 is after the errors on this data were analysed. It can be seen that the error reduction is between 0%and 20%depending on the test, and that the objective of reducing the difference in performance across tests was acheived.