Improved Training for Online End-to-end Speech Recognition Systems
Suyoun Kim, Michael Seltzer, Jinyu Li and Rui Zhao
Abstract:
Achieving high accuracy with end-to-end speech recognizers requires careful parameter initialization prior to training. Otherwise, the networks may fail to find a good local optimum. This is particularly true for online networks, such as unidirectional LSTMs. Currently, the best strategy to train such systems is to bootstrap the training from a tied-triphone system. However, this is time consuming and more importantly, is impossible for languages without a high-quality pronunciation lexicon. In this work, we propose an initialization strategy that uses teacher-student learning to transfer knowledge from a large, well-trained, offline end-to-end speech recognition model to an online end-to-end model, eliminating the need for a lexicon or any other linguistic resources. We also explore curriculum learning and label smoothing and show how they can be combined with the proposed teacher-student learning for further improvements. We evaluate our methods on a Microsoft Cortana personal assistant task and show that the proposed method results in a 19% relative improvement in word error rate compared to a randomly-initialized baseline system.
Cite as: Kim, S., Seltzer, M., Li, J., Zhao, R. (2018) Improved Training for Online End-to-end Speech Recognition Systems. Proc. Interspeech 2018, 2913-2917, DOI: 10.21437/Interspeech.2018-2517.
BiBTeX Entry:
@inproceedings{Kim2018,
author={Suyoun Kim and Michael Seltzer and Jinyu Li and Rui Zhao},
title={Improved Training for Online End-to-end Speech Recognition Systems},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={2913--2917},
doi={10.21437/Interspeech.2018-2517},
url={http://dx.doi.org/10.21437/Interspeech.2018-2517} }