Abstract:
Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style
information. However, this embedding process may encode redundant textual information. This phenomenon is called
content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision
loss functions. In this study, we propose an unsupervised method called the ‘‘information sieve’’ to reduce the
effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be
forced to focus on style information rather than on textual information contained in the reference speech by a
well-designed downsample--upsample filter, i.e., the extracted style embeddings can be downsampled at a certain
interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers to
help the system learn a better latent style space. Objective metrics such as the significantly lower word error rate
(WER) demonstrate the effectiveness of this model in mitigating content leakage . Listening tests indicate that the
model retains its prosody transferability compared with the baseline models such as the original GST-Tacotron and
ASR-guided Tacotron.
Definition (same as defined in our submitted paper)
Ground Truth: Records directly selected from testset in Blizzard Challenge 2013 Original GST-Tacotron: Original Tacotron model combined with Global Style Tokens. Sieve GST: GST Tacotron style encoder combined with an Information Sieve layer. I-G: Replace batch normalization used in convolutional layers of style encoder with instance
normalization. S-I-G model: Our proposed model with Information Sieve and Instance Normalization in our style
encoder.
Attention: In order to better demonstrate difference of audios, we do not
use audios generated by MelGAN here.
Word Error Rate per model:
Original GST
ASR-G
Sieve GST
I-G
S-I-G
WER
0.2942
0.1729
0.1133
0.2453
0.1163
1.This is not the first time, nor the second, but it shall be the last.
With target reference:
Ground Truth Record & Target Reference Audio
Original GST (using target reference)
Sieve GST (using target reference)
I-G (using target reference)
S-I-G (using target reference)
With random reference:
Ground Truth random reference audio
Original GST (using random reference)
Sieve GST (using random reference)
I-G (using random reference)
S-I-G (using random reference)
2. At any other time this would have been felt dreadfully.