Data Requirements, Selection and Augmentation for DNN-based Speech Synthesis from Crowdsourced Data
Markus Toman, Geoffrey S. Meltzner and Rupal Patel
Abstract:
Crowdsourcing speech recordings provides unique opportunities and challenges for personalized speech synthesis as it allows gathering of large quantities of data but with a huge variety in quality. Manual methods for data selection and cleaning quickly become infeasible, especially when producing larger quantities of voices. We present and analyze approaches for data selection and augmentation to cope with this. For differently-sized training sets, we assess speaker adaptation by transfer learning, including layer freezing and sentence selection using maximum likelihood of forced alignment. The methodological framework utilizes statistical parametric speech synthesis based on Deep Neural Networks (DNNs). We compare objective scores for 576 voice models, representing all condition combinations. For a constrained set of conditions we also present results from a subjective listening test. We show that speaker adaptation improves overall quality in nearly all cases, sentence selection helps detecting recording errors and layer freezing proves to be ineffective in our system. We also found that while Mel-Cepstral Distortion (MCD) does not correlate with listener preference across the range of values, the most preferred voices also exhibited the lowest values for MCD. These findings have implications on scalable methods of customized voice building and clinical applications with sparse data.
Cite as: Toman, M., Meltzner, G.S., Patel, R. (2018) Data Requirements, Selection and Augmentation for DNN-based Speech Synthesis from Crowdsourced Data. Proc. Interspeech 2018, 2878-2882, DOI: 10.21437/Interspeech.2018-1316.
BiBTeX Entry:
@inproceedings{Toman2018,
author={Markus Toman and Geoffrey S. Meltzner and Rupal Patel},
title={Data Requirements, Selection and Augmentation for DNN-based Speech Synthesis from Crowdsourced Data},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={2878--2882},
doi={10.21437/Interspeech.2018-1316},
url={http://dx.doi.org/10.21437/Interspeech.2018-1316} }