Proceedings 2001

Contents

Improvement and Evaluation of an On-line Auto Attendant System

Woei-Chyang SHIEH, Shih-Chieh CHIEN, Jyh-Shing Hsu, Sen-Chia CHANG

 

 

1        Introduction

Auto attendant systems allow callers to reach a person by simply speaking the persons name. Such systems have been widely used in corporate environments. We introduced an auto attendant system for our division in January 2000 (Jou et al., 2000), with a directory name list size 1000. Due to the high recognition accuracy and convenience, the division auto attendant system was a great success. We were asked to install an auto attendant system for our whole institute with name list size more than 6000. Speech recognition of Chinese names with such name list size is a difficult task; because typically Chinese names consist of only 3 syllables and the confusability among names are great. To improve recognition rate and system correctness of the on-line auto attendant system, several approaches have been made, and will be described in the following section. The auto attendant system becomes very popular after our improvements and the number of incoming calls increases steadily. The system average call is 300 per day in the beginning, and rises to 900 calls per day in March 2001.

 

2        System Improvements

The improvements are listed below: (1) Tone recognition. Tone recognition are added in order to distinguish lexical tones of Mandarin Chinese, we used multi-layer perceptron (MLP) as tone recognizer. (2) Finer recognition models. To improve the recognition rate, finer recognition model are used, which is derived by using better training database, using on-line data adaptation method, and using minimum classification error/generalized probabilistic descent (MCE/GPD) training method. (3) Multi-tread. To allow multi-user, multi-tread ability is implemented in system. (4) Voice mail.  Ability to send voice mail is added in system.

 

2.1        Tone Recognition

To recognize the tones is an important task in our auto attendant system, because Mandarin Chinese is a tonal language. Each Chinese character is pronounced as a monosyllable and is assigned a tone. In large vocabulary Mandarin speech recognition system, some of the words have the same phones, but are differentiated only by tones, especially when the keywords are constituted mostly of Chinese names. Although there are only 5 tones in Mandarin Chinese, to recognize each tone correctly is a difficult task. In our system, MLP is used for tone recognition. The MLP has 18 input nodes, which represent 18 pitch parameters of a syllable, and 100 nodes in hidden layer and 5 nodes in output layer, which represent 5 Mandarin tones.

 

Tone score can be combined with keyword recognition score and utterance verification score. This combined score achieved better recognition accuracy than using recognition scorealone. Combine score recognition system is shown in Figure 1. First, the feature parameters are extracted from speech signal. In the following step, keyword search module generates multi keyword candidates. Viterbi decoding method segments each candidate into sub-syllable units. The sub-syllable segmentation information is used by utterance verification and tone recognition module to produce verification score and tone score. A combined score of verification, tone and recognition score is generated for each candidate. A final accept/reject decision is made for each candidate according to his combined score.

 
 

 

D

E

C

 I

S

I

O

N

 

 

Utterance Verification

 

                                      Multi Candidates

       
 

 

Keyword Search

 
 

 

Viterbi

Decoding

 
 

 

                                                     

Speech

 

                                                                                             Verification Score

               
     
     
 
     

Output

 
   
 
 

 

                                                        

                                                                                                       

                                                                                                      Tone Score

                                                                                                      

                                                                                              Recognition Score

 

Fig. 1 Combine score recognition system

2.2        Better training data and MCE/GPD training

The choice of training data has great influence on system performance. MAT (Mandarin speech data Across Taiwan) is a Mandarin telephone speech database, which is widely used by research institutes in Taiwan. We used MAT800 (Wang, 1997) as training database in our previous version of division auto attendant system. MAT800 is the first version of the MAT database, which contains 52671 speech data files from 800 speakers. MAT2000 (Wang et al., 2000), which contains 163215 speech data files from 2444 speakers, is released during the development of our institute auto attendant system. MAT2000 is a better database both in quality and quantity. We used this better training database to improve our on-line systemperformance.

 

MCE/GPD (Juang et al.,1997) is a discriminative training method to minimize classification error, and is an effective way to improve the recognition performance. We applied MCE/GPD training to improve system performance further.

2.3        On-line adaptation

MAT2000 training database is a phonetically balanced telephone speech database. It is not designed for special purpose such as auto attendant system. The on-line data of auto attendant system constitute mostly of name queries. We may collect the on-line data and use such data to adapt the system to auto attendant domain.

 

For development purpose, on-line data are partitioned as training set, evaluation set and test set. On-line data recorded from June 29 to August 31, 2000 is used as training set, which contains 15787 utterances. Evaluation set is collected from September 1 to September 14, which contains 4553 utterances. Evaluation set is used for the selection of system parameter, such as verification threshold and combination weighting. The on-line data recorded from September 15 to October 1 is used as testing set, to test the system performance.

 

Recognition model and verification model both can be adapted by on-line data. MCE/GPD method is applied to recognition model for on-line data adaptation. Minimum verification error/generalized probabilistic descent (MVE/GPD) method is applied to verification model for on-line data adaptation (Sukkar, 1998).

2.4        Others

Some Chinese names have the same pronunciation. To distinguish employees with the same pronunciation of name, division verification is added to auto attendant system. If the recognition candidates includes division name, utterance verification is applied to division name. Once the division name is accepted by verification module, only employee’s name with the division is accepted by the system.

 

Multi-thread function is implemented to provide the ability to handle multi user simultaneously, and we provided the ability to send voice mail to make our system more convenient.

3        System performance

Figure 2 shows the average incoming calls per working day. We can see the average call increases steadily from 300 in June 2000 to more than 800 in February 2001. Figure 3 shows monthly calls, which rises from 6000 to more than 15000 calls every month. Totally, the auto attendant system received more than 90000 calls in 8 months.

 

Fig. 2 average calls per working day

 

The performance of keyword queries is shown Figure 4. There are three system performance measures for keyword queries:  substitution error, false reject, and correct reject. Substitution error means all the output recognition candidates are wrong. Here, the system will output at most three verified name candidates in order to include similar name candidates. False reject means recognition candidates include correct keyword, after verification this keyword is being rejected. Correct reject means all the recognition candidates are wrong, after verification all the candidates are being correctly rejected.

Figure 5 shows the system result of OOV (out-of-vocabulary) queries. There are two types of response: correct reject and false alarm. Correct reject means system rejects all the wrong recognition candidates, false alarm means system output any wrong candidate. Besides, the keyword queries recognition performance is compared in Table 1 for different versions of auto attendant system.

 

Fig. 3 monthly calls

 

Fig. 4 Keyword Queries System Performance

 

 

The auto attendant system is first announced on June 29, 2000. From June 2000 to March 2001, four versions of auto attendant system are applied on-line.

 

First version of auto attendant system is a modification version of our previous division auto attendant system. We added multi-tread ability to allow multi-user, ability to send voice mail and division verification ability.

 

Second version of system is released on Oct. 2, 2000. Tone recognition is applied in this version. Meanwhile, recognition score is combined with verification and tone score to achieve better recognition performance. After the modification, top 1 recognition rate of keyword queries is improved from 78.8% to 81.0% as shown in Table 1. Substitution error in Figuer 4 is reduced from 10.1% to 9.1% with false rejection rate under 3%.

 

Third version of system is released on Dec. 1, 2000. Better training database MAT2000 is used instead of MAT800. Meanwhile, MCE/GPD method is used to train recognition model, and MVE/GPD method is used to train verification model. After the change of training model and GPD training the performance is improved remarkably. Top 1 recognition rate is improved to 87.5% as compared to 81.0% in second version. Top 3 recognition rate achieved as high as 95.3%. Substitution error in Figure 4 is reduced from 9.1% to 3.1%.

 

Last version is released on Feb. 15, 2001. On-line data adaptation is implemented in this version to improve the system performance. Top 1 recognition rate reached 89.5% as compared to 87.5% in third version, and substitution error rate reduced from 3.1% to 2.2%. Verification performance is improved in this version also. The false alarm is reduced from 76.7% in version 3 to 63.3% in this version, with false rejection rate under 2%.

 

 

Top 1

Top 2

Top 3

Version 1

78.8%

86.6%

89.1%

Version 2

81.0%

88.0%

90.7%

Version 3

87.5%

93.4%

95.3%

Version 4

89.5%

95.8%

96.6%

Table 1. Keyword Queries Recognition Rate

 

Fig. 5 OOV Queries Verification Performance

 

 

4        Conclusions

Due to the convenience and accuracy of auto attendant system, the number of incoming calls is rising steadily from 300 calls per day in the beginning to more than 900 in March 2001. During this time period, we made several improvements to provide better service and to achieve more accurate system response. Among these approaches, the improvement of training data had great influence of system performance. System accuracy is improved by using both better training data MAT2000 and on-line collected data. Besides, tone recognition, MCE/GPD and MVE/GPD training all are important factors of system improvement. After our improvements, top 1 recognition rate of keyword queries is improved from 78.8% to 89.5% with an error reduction rate of 50.5%. Substitution error rate is reduced from 10.1% to 2.2% with an error reduction rate of 78.2%. 

5        Acknowledgment

This paper is a partial result of the project No. 3XS1B11 conducted by Industrial Technology Research Institute under sponsorship of the Ministry of Economic Affairs, Taiwan, R.O.C.

 

The authors would like to thank the Association for Computational Linguistics and Chinese Language Processing in Taiwan for kindly supplying the database.

 

6        References

Jou, S.-C., Chien, S.-C., Shieh, W.-C., Chen, J.-H., Chang, S.-C., 2000. CCL eAttendant : An on-line Auto Attendant System. In: International Symposium on Chinese Spoken Language Processing, Beijing, China.

Juang, B.-H., Chou, W., Lee C.-H., 1997. Minimum Classification Error Rate Methods for Speech Recognition. In: IEEE Trans. on Speech and Audio Processing, Vol. 5, No.3, pp. 257-265, May 1997.

Sukkar, R. A., 1998. Subword-Based Minimum Verification Error (SB-MVE) Training for Task Independent Utterance Verification. In: Proceedings of International Conference on Acoustic, Speech and Signal Processing, pp. 229-232, Washington, USA.

Wang, H.-C., 1997. MAT - A Project to Collect Mandarin Speech Data Through Networks in Taiwan. In: Computational Linguistics and Chinese Language Processing, Vol. 2, No. 1, pp. 73-89, Feb. 1997.

Wang, H.-C., Seide, F., Tseng, C.-Y., Lee, L.-S., 2000. MAT-2000: Design, Collection, and Validation of a Mandarin 2000-Speaker Telephone Speech Database. In: Proceedings of International Conference on Spoken Language Processing, Beijing, China.