Nonparallel Expressive TTS for Unseen Target Speaker using Style-Controlled Adaptive Layer and Optimized Pitch Embedding

Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh

LibriTTS corpus: 110 hours audios of 1151 speakers and their corresponding text transcripts

Pre-trained model: 120000.pth.tar

Abstract: Recent advancements in text-to-speech (TTS) systems have focused on developing style-controlled models that generate speech with desired characteristics such as accent, tone, pitch, and expressiveness. However, controlling the style during the training process remains a challenge, mainly when dealing with nonparallel data (i.e., where text and audio data lack perfect alignment on a one-to-one basis). In this paper, we propose a nonparallel expressive TTS approach by first introducing a novel advanced control style adaptive layer normalization method to improve the controllability and stability of the training process. Second, we present an optimized pitch embedding method that enhances flexibility and control over pitch manipulation, resulting in improved accuracy in pitch generation. Our experimental results demonstrate that our approach achieves state-of-the-art performance in naturalness and similarity to the speaking styles of the synthesized speeches for unseen target speakers.

Nonparallel: it generate speech in the desired "Style" for the "Unseen" target speaker

1) Female Samples

Text #1: "I don't like miss stackpole-everything about her displeases me; she talks so much too loud and looks at one as if one wanted to look at her-which one doesn't"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed

Text #2: "he also believes that the psychological process operates through transference"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed

Text #3: "the night of the sixteenth to the seventeenth of february, eighteen thirty three, was a blessed night"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed

Text #4: "and the older one answered back, "well, you're not so good looking" (which was also true)"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed

2) Male Samples

Text #1: "a small pond in the track of the cloud was sucked dry, the water being carried over the adjoining fields together with a large quantity of soft mud, which was scattered over the ground for half a mile around"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed

Text #2: "thirty years ago, you could take an old muzzle loader and knock over plenty of ducks in the city limits, and chicago wasn't cook county then, either"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed

Text #3: "from the golden brightness, displayed by them at noon, they have changed to a lurid red-as if there was anger in the sky!"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed

Text #4: "somewhere about five o'clock pod came into camp with a good mess of trout"

Natural	Conformer FS2_PWGAN	VITS_ESPnet2	Proposed