HUB



A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis

Kai-Zhan Lee, Erica Cooper and Julia Hirschberg

Abstract:

Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained on the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work.


Cite as: Lee, K., Cooper, E., Hirschberg, J. (2018) A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis. Proc. Interspeech 2018, 2873-2877, DOI: 10.21437/Interspeech.2018-1313.


BiBTeX Entry:

@inproceedings{Lee2018,
author={Kai-Zhan Lee and Erica Cooper and Julia Hirschberg},
title={A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={2873--2877},
doi={10.21437/Interspeech.2018-1313},
url={http://dx.doi.org/10.21437/Interspeech.2018-1313} }