Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks
Sheng Li, Xugang Lu, Ryoichi Takashima, Peng Shen, Tatsuya Kawahara and Hisashi Kawai
Abstract:
Connectionist temporal classification (CTC) has shown great potential in end-to-end (E2E) acoustic modeling. The current state-of-the-art architecture for a CTC-based E2E model is based on a deep bidirectional long short-term memory (BLSTM) network that provides frame-wise outputs estimated from both forward and backward directions (BLSTM-CTC). Since this architecture can lead to a serious time latency problem in decoding, it cannot be applied to real-time speech recognition tasks. Considering that the CTC label of one current frame can only be affected by a few neighboring frames, we argue that using BLSTM traversing on a whole utterance from both directions is not necessary. In this paper, we use a very deep residual time-delay (VResTD) network for CTC-based E2E acoustic modeling (VResTD-CTC). The VResTD network provides frame-wise outputs with local bidirectional information without needing to wait for the whole utterance. Speech recognition experiments on Corpus of Spontaneous Japanese were carried out to test our proposed VResTD-CTC and the state-of-the-art BLSTM-CTC model. Comparable performance was obtained while the proposed VResTD-CTC does not suffer from the decoding time latency problem.
Cite as: Li, S., Lu, X., Takashima, R., Shen, P., Kawahara, T., Kawai, H. (2018) Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks. Proc. Interspeech 2018, 3708-3712, DOI: 10.21437/Interspeech.2018-1475.
BiBTeX Entry:
@inproceedings{Li2018,
author={Sheng Li and Xugang Lu and Ryoichi Takashima and Peng Shen and Tatsuya Kawahara and Hisashi Kawai},
title={Improving CTC-based Acoustic Model with Very Deep Residual Time-delay Neural Networks},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={3708--3712},
doi={10.21437/Interspeech.2018-1475},
url={http://dx.doi.org/10.21437/Interspeech.2018-1475} }