Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao and Fil Alleva
Abstract:
The goal of this work is to develop a meeting transcription system that can recognize speech even when utterances of different speakers are overlapped. While speech overlaps have been regarded as a major obstacle in accurately transcribing meetings, a traditional beamformer with a single output has been exclusively used because previously proposed speech separation techniques have critical constraints for application to real meetings. This paper proposes a new signal processing module, called an unmixing transducer and describes its implementation using a windowed BLSTM. The unmixing transducer has a fixed number, say J, of output channels, where J may be different from the number of meeting attendees and transforms an input multi-channel acoustic signal into J time-synchronous audio streams. Each utterance in the meeting is separated and emitted from one of the output channels. Then, each output signal can be simply fed to a speech recognition back-end for segmentation and transcription. Our meeting transcription system using the unmixing transducer outperforms a system based on a state-of-the-art neural mask-based beamformer by 10.8%. Significant improvements are observed in overlapped segments. To the best of our knowledge, this is the first report that applies overlapped speech recognition to unconstrained real meeting audio.
Cite as: Yoshioka, T., Erdogan, H., Chen, Z., Xiao, X., Alleva, F. (2018) Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks. Proc. Interspeech 2018, 3038-3042, DOI: 10.21437/Interspeech.2018-2284.
BiBTeX Entry:
@inproceedings{Yoshioka2018,
author={Takuya Yoshioka and Hakan Erdogan and Zhuo Chen and Xiong Xiao and Fil Alleva},
title={Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={3038--3042},
doi={10.21437/Interspeech.2018-2284},
url={http://dx.doi.org/10.21437/Interspeech.2018-2284} }