MamTra: Hybrid Mamba-Transformer for TTS

Abstract

Efficiency meets expressivity.

Despite the remarkable quality of LLM-based text-to-speech systems, their reliance on autoregressive Transformers leads to quadratic computational complexity, severely limiting practical deployment. Linear-time alternatives like Mamba offer efficiency, but often sacrifice global context.

We propose MamTra, an interleaved Mamba-Transformer framework designed to leverage Mamba's efficiency alongside Transformers' modeling capability. Novel knowledge transfer strategies distill insights from a pretrained Transformer into the hybrid architecture — bypassing prohibitive training costs. Finetuned on only 2% of the teacher's data, MamTra reduces VRAM by 34% without compromising speech quality.

📢 To ensure the reproducibility of the results, we currently provide inference code for MamTra 1:1. Other configurations and training code will be available soon after checking all anonymized code due to the Interspeech's review policy.

−34%

VRAM Usage

Sub-Linear

Inference Cache

Distillation Data

Interactive Demo

Pipeline Demo

Listen to reference audio and MamTra synthesized output. Click the waveforms or use the controls below.

Text Input

“Hello, how are you doing today? I hope you have a very nice day with great results.”

Reference Audio

0:00 --:--

Transformer

Mamba

Output Speech (MamTra)

0:00 --:--

Architecture

Hybrid Layer Configurations

Choosen interleaved configurations for optimal efficiency and expressivity, those are reported in the paper and their samples are shown in below section.

Each row shows the 24-layer backbone. Hover over a block to see its type. (Might not work in mobile's browsers)

Transformer

Mamba

MamTra 1 : 1

MamTra 1 : 3

MamTra 1 : 5

MamTra 1 : 11

0123 4567 891011 12131415 16171819 20212223

Figures

Architecture & Analysis

MamTra is not only an architectural introduction but also a practical analysis of hybrid Mamba-Transformer architectures.

Figure 1. Overview of the hybrid Mamba-Transformer configurations for speech synthesis. Selective Transformer-to-Mamba layer transfer reduces training cost and accelerates convergence, while performance is recovered via knowledge distillation using less than 2% of the teacher's English training data.

Figure 2. Following the alignment between the SSM recurrence and linearized attention, projection weights for C, B, and x in Mamba are initialized with the Transformer's Q, K, and V projection weights, respectively.

Figure 3. Memory usage (GB) on Seed-TTSeval (NVIDIA A6000). MamTra 1:1 reduces VRAM by 34% and 17% compared to CosyVoice 2 and Zonos-v0.1, respectively.

Figure 4. Cross-entropy (CE) loss and Word Error Rate (WER) on Seed-TTSeval for hybrid Mamba-Transformer variants after 15 training epochs on LibriTTS (0.5k h). CE loss exhibits consistent correlation with WER across different ratios and placement strategies.

Figure 5. Cache size growth in the hybrid model, where sequence-length dependency scales with the number of remaining Transformer layers. Mamba state remains nearly constant while KV-cache growth is significantly attenuated.

Figure 6. Convergence speed after 15 epochs. Reusing pretrained Transformer weights (Reused) leads to significantly faster convergence and lower final loss compared to Xavier or Kaiming initialization.

Listening

Audio Samples

Each row contains synthesized speech from the same prompt across all models. GT is the ground-truth reference speaker used for voice cloning.

Note: UTMOS of Zonos is extremely low due to samples with severe artifacts, for example the sample number 17, which significantly drags down the average score.

GT (Reference Voice & Style) CosyVoice 2 LLaSA-1B Zonos-v0.1 MamTra 1:1 MamTra 1:3 MamTra 1:5 MamTra 1:11

Cite

BibTeX

@inproceedings{mamtra2026, title = {MamTra: A Hybrid Mamba-Transformer Backbone for Speech Synthesis}, author = {Anonymous}, booktitle = {Sumitted to Interspeech}, year = {2026} }