Exploiting Speaker and Phonetic Diversity of Mismatched Language Resources for Unsupervised Subword Modeling
Siyuan Feng and Tan Lee
Abstract:
This study addresses the problem of learning robust frame-level feature representation for unsupervised subword modeling in the zero-resource scenario. Robustness of the learned features is achieved through effective speaker adaptation and exploiting cross-lingual phonetic knowledge. For speaker adaptation, an out-of-domain automatic speech recognition (ASR) system is used to estimate fMLLR features for untranscribed speech of target zero-resource languages. The fMLLR features are applied in multi-task learning of a deep neural network (DNN) to further obtain phonetically discriminative and speaker-invariant bottleneck features (BNFs). Frame-level labels for DNN training can be acquired based on two approaches: Dirichlet process Gaussian mixture model (DPGMM) clustering and out-of-domain ASR decoding. Moreover, system fusion is performed by concatenating BNFs extracted by different DNNs. Our methods are evaluated by ZeroSpeech 2017 Track one, where the performance is evaluated by ABX minimal pair discriminability. Experimental results demonstrate that: (1) Using an out-of-domain ASR system to perform speaker adaptation of zero-resource speech is effective and efficient; (2) Our system achieves highly competitive performance to state of the art; (3) System fusion could improve feature representation capability.
Cite as: Feng, S., Lee, T. (2018) Exploiting Speaker and Phonetic Diversity of Mismatched Language Resources for Unsupervised Subword Modeling. Proc. Interspeech 2018, 2673-2677, DOI: 10.21437/Interspeech.2018-1081.
BiBTeX Entry:
@inproceedings{Feng2018,
author={Siyuan Feng and Tan Lee},
title={Exploiting Speaker and Phonetic Diversity of Mismatched Language Resources for Unsupervised Subword Modeling},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={2673--2677},
doi={10.21437/Interspeech.2018-1081},
url={http://dx.doi.org/10.21437/Interspeech.2018-1081} }