Advancing Limited Data Text-to-Speech Synthesis: Non-Autoregressive Transformer for High-Quality Parallel Synthesis

Authors: Mohammed Salah Al-Radhi, Omnia Ibrahim, Ali Raheem Mandeel, Tamás Gábor Csapó, Géza Németh

Inference: pretrained model


Fast Arabic Single-Speaker TTS

Phoneme Sequence #1: "lakin~a diraAsatahumo - >a^abotato >an~a Alomu$okilapa bimiSora - layosato faqaTo fiy kam~iy~api AlT~aEaAmi"

Natural Tacotron_WaveGlow FastSp2_HiFi FastSp2_PWG


Phoneme Sequence #2: "yasotaDiyfu maEohadu AloEaAlami AloEarabiy~i fiy baAriysa - maEoriDAF biEunowaAni - kaAna yaA makaAn - qiTaAru Al$~aroqi Als~ariyEu"

Natural Tacotron_WaveGlow FastSp2_HiFi FastSp2_PWG


Phoneme Sequence #3: "AloHimoDiy~aAtu gany~apN bimukawonaAtK SiH~iy~apK lijisomi AloinosaAni"

Natural Tacotron_WaveGlow FastSp2_HiFi FastSp2_PWG


Phoneme Sequence #4: "watu&ak~idu EaAlimapu Aln~afosi >an~a Alo>asobaAba AlomanoTiqiy~apa AlomuHaf~izapa EalaY mumaArasapi Alr~iy~aADapi - laA takofiy waHodahaA"

Natural Tacotron_WaveGlow FastSp2_HiFi FastSp2_PWG


Phoneme Sequence #5: "ak~ada AlokaAtibu waAln~aAqidu"

Natural Tacotron_WaveGlow FastSp2_HiFi FastSp2_PWG

Visualization

The image below demonstrates spectrogram visualization and pitch contours extracted from synthesized speech samples. Top-left: Ground-truth; Top-right: Tacotron2; Bottom-right: FastSp2-HiFi; and Bottom-left: Developed FastSp2-PWG.