Interspeech 2018

Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment

Abstract:

This work proposes to incorporate paired phone-posteriors as input features into a neural net (NN) model for assessing ESL learner’s pronunciation quality. In this work, posteriors of forty phones, instead of several thousand sub-phonemic senones, are used to circumvent the sparsity issues in NN training. Phone posteriors are assembled with their corresponding senone posteriors estimated via a speaker-independent, DNN-based acoustic model, trained with standard American English speech data (i.e., Wall Street Journal database). Phone posteriors of both reference(standard American English speaker) and test speaker are paired together as augmented input feature vectors to train an NN based, 2-class, i.e., native vs nonnative speaker, classiﬁer. The Goodness of Pronunciation (GOP), a proven effective measure, is used as the baseline for comparison. The binary NN classiﬁer trained with such features achieves a high classification accuracy of 89.6% on native and non-native speakers’ data. The classiﬁer also shows a better equal error rate (EER) than the GOP-based baseline classiﬁer in either phone or word level pronunciation, i.e., at phone level from 18.3% to 6.2% and at word level from 12.98% to 2.54%.

Cite as: Xiao, Y., Soong, F., Hu, W. (2018) Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment. Proc. Interspeech 2018, 1631-1635, DOI: 10.21437/Interspeech.2018-1270.

BiBTeX Entry:

@inproceedings{Xiao2018,
author={Yujia Xiao and Frank Soong and Wenping Hu},
title={Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={1631--1635},
doi={10.21437/Interspeech.2018-1270},
url={http://dx.doi.org/10.21437/Interspeech.2018-1270} }