A Hybrid Mamba-Transformer Backbone for Speech Synthesis
Despite the remarkable quality of LLM-based text-to-speech systems, their reliance on autoregressive Transformers leads to quadratic computational complexity, severely limiting practical deployment. Linear-time alternatives like Mamba offer efficiency, but often sacrifice global context.
We propose MamTra, an interleaved Mamba-Transformer framework designed to leverage Mamba's efficiency alongside Transformers' modeling capability. Novel knowledge transfer strategies distill insights from a pretrained Transformer into the hybrid architecture — bypassing prohibitive training costs. Finetuned on only 2% of the teacher's data, MamTra reduces VRAM by 34% without compromising speech quality.
Listen to reference audio and MamTra synthesized output. Click the waveforms or use the controls below.
Choosen interleaved configurations for optimal efficiency and expressivity, those are reported in the paper and their samples are shown in below section.
Each row shows the 24-layer backbone. Hover over a block to see its type. (Might not work in mobile's browsers)
MamTra is not only an architectural introduction but also a practical analysis of hybrid Mamba-Transformer architectures.
Each row contains synthesized speech from the same prompt across all models. GT is the ground-truth reference speaker used for voice cloning.
Note: UTMOS of Zonos is extremely low due to samples with severe artifacts, for example the sample number 17, which significantly drags down the average score.