📄 Interspeech 2026

MamTra

A Hybrid Mamba-Transformer Backbone for Speech Synthesis

Anonymized for review  · 

📄 Paper Anonymous Github Repo ▶ Audio Samples

Efficiency meets expressivity.

Despite the remarkable quality of LLM-based text-to-speech systems, their reliance on autoregressive Transformers leads to quadratic computational complexity, severely limiting practical deployment. Linear-time alternatives like Mamba offer efficiency, but often sacrifice global context.

We propose MamTra, an interleaved Mamba-Transformer framework designed to leverage Mamba's efficiency alongside Transformers' modeling capability. Novel knowledge transfer strategies distill insights from a pretrained Transformer into the hybrid architecture — bypassing prohibitive training costs. Finetuned on only 2% of the teacher's data, MamTra reduces VRAM by 34% without compromising speech quality.

📢 To ensure the reproducibility of the results, we currently provide inference code for MamTra 1:1. Other configurations and training code will be available soon after checking all anonymized code due to the Interspeech's review policy.

−34%
VRAM Usage
Sub-Linear
Inference Cache
2%
Distillation Data

Pipeline Demo

Listen to reference audio and MamTra synthesized output. Click the waveforms or use the controls below.

Text Input
“Hello, how are you doing today? I hope you have a very nice day with great results.”
Reference Audio
0:00 --:--
Transformer
Mamba
Output Speech (MamTra)
0:00 --:--

Hybrid Layer Configurations

Choosen interleaved configurations for optimal efficiency and expressivity, those are reported in the paper and their samples are shown in below section.

Each row shows the 24-layer backbone. Hover over a block to see its type. (Might not work in mobile's browsers)

Transformer
Mamba
MamTra 1 : 1
MamTra 1 : 3
MamTra 1 : 5
MamTra 1 : 11
0123 4567 891011 12131415 16171819 20212223

Architecture & Analysis

MamTra is not only an architectural introduction but also a practical analysis of hybrid Mamba-Transformer architectures.

Overview of MamTra architecture and training strategy
Figure 1. Overview of the hybrid Mamba-Transformer configurations for speech synthesis. Selective Transformer-to-Mamba layer transfer reduces training cost and accelerates convergence, while performance is recovered via knowledge distillation using less than 2% of the teacher's English training data.
Alignment between SSM recurrence and linearized attention
Figure 2. Following the alignment between the SSM recurrence and linearized attention, projection weights for C, B, and x in Mamba are initialized with the Transformer's Q, K, and V projection weights, respectively.
VRAM usage comparison across models
Figure 3. Memory usage (GB) on Seed-TTSeval (NVIDIA A6000). MamTra 1:1 reduces VRAM by 34% and 17% compared to CosyVoice 2 and Zonos-v0.1, respectively.
Correlation between CE loss and WER across models
Figure 4. Cross-entropy (CE) loss and Word Error Rate (WER) on Seed-TTSeval for hybrid Mamba-Transformer variants after 15 training epochs on LibriTTS (0.5k h). CE loss exhibits consistent correlation with WER across different ratios and placement strategies.
Cache size growth comparison between models
Figure 5. Cache size growth in the hybrid model, where sequence-length dependency scales with the number of remaining Transformer layers. Mamba state remains nearly constant while KV-cache growth is significantly attenuated.
Convergence speed comparison between models
Figure 6. Convergence speed after 15 epochs. Reusing pretrained Transformer weights (Reused) leads to significantly faster convergence and lower final loss compared to Xavier or Kaiming initialization.

Audio Samples

Each row contains synthesized speech from the same prompt across all models. GT is the ground-truth reference speaker used for voice cloning.

Note: UTMOS of Zonos is extremely low due to samples with severe artifacts, for example the sample number 17, which significantly drags down the average score.

GT (Reference Voice & Style) CosyVoice 2 LLaSA-1B Zonos-v0.1 MamTra 1:1 MamTra 1:3 MamTra 1:5 MamTra 1:11

BibTeX

@inproceedings{mamtra2026, title = {MamTra: A Hybrid Mamba-Transformer Backbone for Speech Synthesis}, author = {Anonymous}, booktitle = {Sumitted to Interspeech}, year = {2026} }