Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum

Authors: Mohammed Salah Al-Radhi, Riad Larbi, Mátyás Bartalis, Géza Németh

Abstract: Neural vocoders are central to speech synthesis; despite their success, most still suffer from limited prosody modeling and inaccurate phase reconstruction. We propose a vocoder that introduces prosody-guided harmonic attention to enhance voiced-segment encoding and directly predicts complex-spectral components for waveform synthesis via inverse STFT. Unlike mel-spectrogram–based approaches, our design jointly models magnitude and phase, ensuring phase coherence and improved pitch fidelity. To further align with perceptual quality, we adopt a multi-objective training strategy that integrates adversarial, spectral, and phase-aware losses. Experiments on benchmark datasets demonstrate consistent gains over HiFi-GAN and AutoVocoder: F0-RMSE reduced by 22%, voiced/unvoiced error lowered by 18%, and MOS scores improved by 0.15. These results show that prosody-guided attention combined with direct complex-spectrum modeling yields more natural, pitch-accurate, and robust synthetic speech, setting a strong foundation for expressive neural vocoding.

Model Architecture


Samples below are generated using speech from the LJSpeech (female) and VCTK (male/female) datasets.


1) Female Speaker

Sample 1: LJ018-0213

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: which in dramatic interest can compare with the conviction of William Rupel for forgery.

Sample 2: LJ027-0125

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: In the Python we find very tiny rudiments of the hind limbs. Now,

Sample 3: LJ003-0093

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: The oldest only 12 or 13, exposed to all the contaminating influences of the place.

Sample 4: LJ010-0110

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: Several of the other prisoners took the same line as regards Edward's


2) Male Speaker

Sample 1: p226_028

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: Suggestions of the action being a punishment were dismissed.

Sample 2: p226_157

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: Elaine Baxter did not have long to speak on Friday.

Sample 3: p226_299

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: There was no hint of scandal.

Sample 4: p226_068

Natural Anchor HiFi-GAN AutoVocoder Vocos Proposed
Transcript: Scotland has great assets.


Analysis Figures

F0 Error Distribution (Model − Original) on Voiced Frames
Figure 1. F0 error distribution between model outputs and the original signal on voiced frames. The Proposed model achieves the tightest clustering near zero error, indicating superior pitch tracking compared to HiFi-GAN, AutoVocoder, and Vocos.
Residual Mel Energy vs Original (per frame) over Time
Figure 2. Residual mel energy (L2 difference) over time compared to the original. The Proposed model and HiFi-GAN maintain consistently low residuals, while AutoVocoder exhibits large error spikes and Vocos shows higher variability.
Mel-Spectrogram with F0 (Original vs. HiFi-GAN, AutoVocoder, Vocos, Proposed) with Insets
Figure 3. Mel-spectrograms with F0 overlays. HiFi-GAN produces smoother but weaker harmonics, AutoVocoder retains harmonics with noticeable smearing, and Vocos yields fuzzy, unstable structures. The Proposed method best preserves harmonic sharpness and F0 alignment, closely resembling the original.
Zoomed Mel+F0 comparison (Original, HiFi-GAN, AutoVocoder, Vocos, Proposed)
Figure 4. Zoomed-in Mel+F0 segment. The Proposed model maintains sharp, well-aligned harmonics, while AutoVocoder produces smeared bands, HiFi-GAN shows smoother but drifting pitch, and Vocos displays unstable harmonics.