Multi-channel Attention for End-to-End Speech Recognition
Stefan Braun, Daniel Neil, Jithendar Anumula, Enea Ceolini and Shih-Chii Liu
Abstract:
Recent end-to-end models for automatic speech recognition use sensory attention to integrate multiple input channels within a single neural network. However, these attention models are sensitive to the ordering of the channels used during training. This work proposes a sensory attention mechanism that is invariant to the channel ordering and only increases the overall parameter count by 0.09%. We demonstrate that even without re-training, our attention-equipped end-to-end model is able to deal with arbitrary numbers of input channels during inference. In comparison to a recent related model with sensory attention, our model when tested on the real noisy recordings from the multi-channel CHiME-4 dataset, achieves a relative character error rate (CER) improvement of 40.3% to 42.9%. In a two-channel configuration experiment, the attention signal allows the lower signal-to-noise ratio (SNR) sensor to be identified with 97.7% accuracy.
Cite as: Braun, S., Neil, D., Anumula, J., Ceolini, E., Liu, S. (2018) Multi-channel Attention for End-to-End Speech Recognition. Proc. Interspeech 2018, 17-21, DOI: 10.21437/Interspeech.2018-1301.
BiBTeX Entry:
@inproceedings{Braun2018,
author={Stefan Braun and Daniel Neil and Jithendar Anumula and Enea Ceolini and Shih-Chii Liu},
title={Multi-channel Attention for End-to-End Speech Recognition},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={17--21},
doi={10.21437/Interspeech.2018-1301},
url={http://dx.doi.org/10.21437/Interspeech.2018-1301} }