Speaker-Aware Training of LSTM-RNNS for Acoustic Modelling

Dong Yu¹, Khe Chai Sim², Liang Lu³, Souvik Kundu², Tian Tan⁴, Xiong Xiao⁵, Yanmin Qian⁴, Yu Zhang⁶

¹Microsoft Research
²National University of Singapore
³University of Edinburgh
⁴Shanghai Jiao Tong University
⁵Nanyang Technological University
⁶Massachusetts Institute of Technology

Details

13:30 - 15:30 | Tue 22 Mar | Poster Area H | SP-P1.3

Session: Acoustic Model Adaptation for Speech Recognition I

Abstract

Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models is not well investigated. In this paper, we study the speaker-aware training for LSTM-RNNs that incorporate the speaker information during model training to normalise the speaker variability. We first present several different model architectures for this purpose, and then experimentally evaluate three different types of speaker representations, i.e. I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training in the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs improves word error rates by 6.5\% relative over a very strong LSTM-RNN baseline, which has used fmllr features.