HUB



An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks

Arindam Jati and Panayiotis Georgiou

Abstract:

This paper presents an unsupervised training framework for learning a speaker-specific embedding using a Neural Predictive Coding (NPC) technique. We employ a Recurrent Neural Network (RNN) trained on unlabeled audio with multiple and unknown speaker change points. We assume short-term speaker stationarity and hence that speech frames in close temporal proximity originated from a single speaker. In contrast, two random short speech segments from different audio streams are assumed to originate from two different speakers. Based on this hypothesis, a binary classification scenario of predicting whether an input pair of short speech segments comes from the same speaker or not, is developed. An RNN based deep siamese network is trained and the resulting embeddings, extracted from a hidden layer representation of the network, are employed as speaker embeddings. The experimental results on speaker change points detection show the efficacy of the proposed method to learn short-term speaker-specific features. We also show the consistency of these features via a simple statistics-based utterance-level speaker classification task. The proposed method outperforms the MFCC baseline for speaker change detection and both MFCC and i-vector baselines for speaker classification.


Cite as: Jati, A., Georgiou, P. (2018) An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks. Proc. Interspeech 2018, 1131-1135, DOI: 10.21437/Interspeech.2018-1363.


BiBTeX Entry:

@inproceedings{Jati2018,
author={Arindam Jati and Panayiotis Georgiou},
title={An Unsupervised Neural Prediction Framework for Learning Speaker Embeddings Using Recurrent Neural Networks},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={1131--1135},
doi={10.21437/Interspeech.2018-1363},
url={http://dx.doi.org/10.21437/Interspeech.2018-1363} }