HUB



Auxiliary Feature Based Adaptation of End-to-end ASR Systems

Marc Delcroix, Shinji Watanabe, Atsunori Ogawa, Shigeki Karita and Tomohiro Nakatani

Abstract:

Acoustic model adaptation has been widely used to adapt models to speakers or environments. For example, appending auxiliary features representing speakers such as i-vectors to the input of a deep neural network (DNN) is an effective way to realize unsupervised adaptation of DNN-hybrid automatic speech recognition (ASR) systems. Recently, end-to-end (E2E) models have been proposed as an alternative to conventional DNN-hybrid ASR systems. E2E models map a speech signal to a sequence of characters or words using a single neural network, which greatly simplifies the ASR pipeline. However, adaptation of E2E models has received little attention yet. In this paper, we investigate auxiliary feature based adaptation for encoder-decoder E2E models. We employ a recently proposed sequence summary network to compute auxiliary features instead of i-vectors, as it can be easily integrated into E2E models and keep the ASR pipeline simple. Indeed, the sequence summary network allows the auxiliary feature extraction module to be a part of the computational graph of the E2E model. We demonstrate that the proposed adaptation scheme consistently improves recognition performance of three publicly available recognition tasks.


Cite as: Delcroix, M., Watanabe, S., Ogawa, A., Karita, S., Nakatani, T. (2018) Auxiliary Feature Based Adaptation of End-to-end ASR Systems. Proc. Interspeech 2018, 2444-2448, DOI: 10.21437/Interspeech.2018-1438.


BiBTeX Entry:

@inproceedings{Delcroix2018,
author={Marc Delcroix and Shinji Watanabe and Atsunori Ogawa and Shigeki Karita and Tomohiro Nakatani},
title={Auxiliary Feature Based Adaptation of End-to-end ASR Systems},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={2444--2448},
doi={10.21437/Interspeech.2018-1438},
url={http://dx.doi.org/10.21437/Interspeech.2018-1438} }