Audio Samples

FROM

"INFORMATION SIEVE: CONTENT LEAKAGE REDUCTION IN END-TO-END PROSODY TRANSFER FOR EXPRESSIVE SPEECH SYNTHESIS"

Authors: Xudong Dai^†, Cheng Gong^†, Longbiao Wang^*, Kaili Zhang

Abstract: Expressive neural text-to-speech (TTS) systems incorporate a style encoder to learn a latent embedding as the style information. However, this embedding process may encode redundant textual information. This phenomenon is called content leakage. Researchers have attempted to resolve this problem by adding an ASR or other auxiliary supervision loss functions. In this study, we propose an unsupervised method called the ‘‘information sieve’’ to reduce the effect of content leakage in prosody transfer. The rationale of this approach is that the style encoder can be forced to focus on style information rather than on textual information contained in the reference speech by a well-designed downsample--upsample filter, i.e., the extracted style embeddings can be downsampled at a certain interval and then upsampled by duplication. Furthermore, we used instance normalization in convolution layers to help the system learn a better latent style space. Objective metrics such as the significantly lower word error rate (WER) demonstrate the effectiveness of this model in mitigating content leakage . Listening tests indicate that the model retains its prosody transferability compared with the baseline models such as the original GST-Tacotron and ASR-guided Tacotron.

Definition (same as defined in our submitted paper)

Ground Truth: Records directly selected from testset in Blizzard Challenge 2013
Original GST-Tacotron: Original Tacotron model combined with Global Style Tokens.
Sieve GST: GST Tacotron style encoder combined with an Information Sieve layer.
I-G: Replace batch normalization used in convolutional layers of style encoder with instance normalization.
S-I-G model: Our proposed model with Information Sieve and Instance Normalization in our style encoder.

Attention: In order to better demonstrate difference of audios, we do not use audios generated by MelGAN here.

Word Error Rate per model:

Original GST ASR-G Sieve GST I-G S-I-G

WER 0.2942 0.1729 0.1133 0.2453 0.1163

1.This is not the first time, nor the second, but it shall be the last.

With target reference:

Ground Truth Record & Target Reference Audio Original GST (using target reference) Sieve GST (using target reference) I-G (using target reference) S-I-G (using target reference)

With random reference:

Ground Truth random reference audio Original GST (using random reference) Sieve GST (using random reference) I-G (using random reference) S-I-G (using random reference)

2. At any other time this would have been felt dreadfully.

With target reference:

Ground Truth Record & Target Reference Audio Original GST (using target reference) Sieve GST (using target reference) I-G (using target reference) S-I-G (using target reference)

With random reference:

Ground Truth random reference audio Original GST (using random reference) Sieve GST (using random reference) I-G (using random reference) S-I-G (using random reference)

	Original GST	ASR-G	Sieve GST	I-G	S-I-G
WER	0.2942	0.1729	0.1133	0.2453	0.1163

Ground Truth Record & Target Reference Audio	Original GST (using target reference)	Sieve GST (using target reference)	I-G (using target reference)	S-I-G (using target reference)

Ground Truth random reference audio	Original GST (using random reference)	Sieve GST (using random reference)	I-G (using random reference)	S-I-G (using random reference)