Speaker Adaptive Training in Deep Neural Networks Using Speaker Dependent Bottleneck Features

¹Toshiba Research Europe LTD

Details

13:30 - 15:30 | Tue 22 Mar | Poster Area H | SP-P1.5

Session: Acoustic Model Adaptation for Speech Recognition I

Abstract

The paper proposes an approach to perform speaker adaptive training (SAT) in deep neural networks using a two-stage DNN.The first-stage DNN extracts speaker dependent bottleneck(SDBN) features by updating the weights of the BN layer with speaker specific data. Using the SDBN features, a second-stage DNN is trained in the SAT framework. Choosing the BN layer as the speaker dependent layer instead of one of the hidden layers reduces the number of parameters to be tuned using speaker specific data. Experiments are presented on the Aurora4 task, where the input features are normalised with constrained maximum likelihood linear regression (CMLLR) and speaker information is appended in the form of D-vectors. Following an unsupervised adaptation of BN layer, the proposed approach provides a relative gain of 8.6% and 8.9% WER on top of DNNs trained with FBANK features appended with and without D-vectors respectively. A relative gain of 10.3% WER is observed when applied on top of DNNs trained with CMLLR transformed FBANK features, but the gain in performance saturated when combined with D-vectors. It is observed that supervised adaptation with as little as one minute of audio from a specific speaker improved the performance when compared with the baseline.