Nonparallel Expressive TTS for Unseen Target Speaker using Style-Controlled Adaptive Layer and Optimized Pitch Embedding

Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh

  • LibriTTS corpus: 110 hours audios of 1151 speakers and their corresponding text transcripts
  • Pre-trained model: 120000.pth.tar
  • Abstract: Recent advancements in text-to-speech (TTS) systems have focused on developing style-controlled models that generate speech with desired characteristics such as accent, tone, pitch, and expressiveness. However, controlling the style during the training process remains a challenge, mainly when dealing with nonparallel data (i.e., where text and audio data lack perfect alignment on a one-to-one basis). In this paper, we propose a nonparallel expressive TTS approach by first introducing a novel advanced control style adaptive layer normalization method to improve the controllability and stability of the training process. Second, we present an optimized pitch embedding method that enhances flexibility and control over pitch manipulation, resulting in improved accuracy in pitch generation. Our experimental results demonstrate that our approach achieves state-of-the-art performance in naturalness and similarity to the speaking styles of the synthesized speeches for unseen target speakers.

    Nonparallel: it generate speech in the desired "Style" for the "Unseen" target speaker



    1) Female Samples

    Text #1: "I don't like miss stackpole-everything about her displeases me; she talks so much too loud and looks at one as if one wanted to look at her-which one doesn't"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed


    Text #2: "he also believes that the psychological process operates through transference"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed


    Text #3: "the night of the sixteenth to the seventeenth of february, eighteen thirty three, was a blessed night"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed


    Text #4: "and the older one answered back, "well, you're not so good looking" (which was also true)"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed





    2) Male Samples

    Text #1: "a small pond in the track of the cloud was sucked dry, the water being carried over the adjoining fields together with a large quantity of soft mud, which was scattered over the ground for half a mile around"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed


    Text #2: "thirty years ago, you could take an old muzzle loader and knock over plenty of ducks in the city limits, and chicago wasn't cook county then, either"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed


    Text #3: "from the golden brightness, displayed by them at noon, they have changed to a lurid red-as if there was anger in the sky!"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed


    Text #4: "somewhere about five o'clock pod came into camp with a good mess of trout"

    Natural Conformer FS2_PWGAN VITS_ESPnet2 Proposed