Authors: Mohammed Salah Al-Radhi, Tamás Gábor Csapó, Géza Németh
Abstract: Recent advancements in text-to-speech (TTS) systems have focused on developing style-controlled models that generate speech with desired characteristics such as accent, tone, pitch, and expressiveness. However, controlling the style during the training process remains a challenge, mainly when dealing with nonparallel data (i.e., where text and audio data lack perfect alignment on a one-to-one basis). In this paper, we propose a nonparallel expressive TTS approach by first introducing a novel advanced control style adaptive layer normalization method to improve the controllability and stability of the training process. Second, we present an optimized pitch embedding method that enhances flexibility and control over pitch manipulation, resulting in improved accuracy in pitch generation. Our experimental results demonstrate that our approach achieves state-of-the-art performance in naturalness and similarity to the speaking styles of the synthesized speeches for unseen target speakers.

Nonparallel: it generate speech in the desired "Style" for the "Unseen" target speaker
| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|
| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|
| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|
| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|

| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|
| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|
| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|
| Natural | Conformer FS2_PWGAN | VITS_ESPnet2 | Proposed |
|---|---|---|---|
