Abstract:
Recent advancements in text-to-speech (TTS) systems have focused on developing style-controlled models that generate speech with desired characteristics such as accent, tone, pitch, and expressiveness. However, controlling the style during the training process remains a challenge, mainly when dealing with nonparallel data (i.e., where text and audio data lack perfect alignment on a one-to-one basis). In this paper, we propose a nonparallel expressive TTS approach by first introducing a novel advanced control style adaptive layer normalization method to improve the controllability and stability of the training process. Second, we present an optimized pitch embedding method that enhances flexibility and control over pitch manipulation, resulting in improved accuracy in pitch generation. Our experimental results demonstrate that our approach achieves state-of-the-art performance in naturalness and similarity to the speaking styles of the synthesized speeches for unseen target speakers.
Nonparallel: it generate speech in the desired "Style" for the "Unseen" target speaker
1) Female Samples
Text #1: "I don't like miss stackpole-everything about her displeases me; she talks so much too loud and looks at one as if one wanted to look at her-which one doesn't"
Natural
Conformer FS2_PWGAN
VITS_ESPnet2
Proposed
Text #2: "he also believes that the psychological process operates through transference"
Natural
Conformer FS2_PWGAN
VITS_ESPnet2
Proposed
Text #3: "the night of the sixteenth to the seventeenth of february, eighteen thirty three, was a blessed night"
Natural
Conformer FS2_PWGAN
VITS_ESPnet2
Proposed
Text #4: "and the older one answered back, "well, you're not so good looking" (which was also true)"
Natural
Conformer FS2_PWGAN
VITS_ESPnet2
Proposed
2) Male Samples
Text #1: "a small pond in the track of the cloud was sucked dry, the water being carried over the adjoining fields together with a large quantity of soft mud, which was scattered over the ground for half a mile around"
Natural
Conformer FS2_PWGAN
VITS_ESPnet2
Proposed
Text #2: "thirty years ago, you could take an old muzzle loader and knock over plenty of ducks in the city limits, and chicago wasn't cook county then, either"
Natural
Conformer FS2_PWGAN
VITS_ESPnet2
Proposed
Text #3: "from the golden brightness, displayed by them at noon, they have changed to a lurid red-as if there was anger in the sky!"
Natural
Conformer FS2_PWGAN
VITS_ESPnet2
Proposed
Text #4: "somewhere about five o'clock pod came into camp with a good mess of trout"