Interspeech 2018

Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents

Antoine Bruguier, Heiga Zen and Arkady Arkhangorodsky

Abstract:

Many Japanese text-to-speech (TTS) systems use word-level pitch accents as one of the prosodic features. Combination of a pronunciation dictionary including lexical pitch accents and a statistical model representing the word accent sandhi is often used to predict pitch accents from a text. However, using human transcribers to build the dictionary and training data for the model is tedious and expensive. This paper proposes a neural pitch accent recognition model. This model combines the information from audio and its transcription (word sequence in hiragana characters) via two-dimensional attention and outputs word-level pitch accents. Experimental results show a reduction in the word pitch accent prediction error rate over that with text only. It lowers the load of human annotators when building a pronunciation dictionary. As the approach is general, it can be used to do pronunciation learning in other languages as well.

Cite as: Bruguier, A., Zen, H., Arkhangorodsky, A. (2018) Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents. Proc. Interspeech 2018, 1284-1287, DOI: 10.21437/Interspeech.2018-1381.

BiBTeX Entry:

@inproceedings{Bruguier2018,
author={Antoine Bruguier and Heiga Zen and Arkady Arkhangorodsky},
title={Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents},
year=2018,
booktitle={Proc. Interspeech 2018},
pages={1284--1287},
doi={10.21437/Interspeech.2018-1381},
url={http://dx.doi.org/10.21437/Interspeech.2018-1381} }