Tiny Audio

A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with Tiny Audio—a minimal, hackable ASR framework.

Architecture

Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text

Only the projector is trained (~12M params). The encoder and decoder remain frozen.

Training

Dataset LoquaciousSet (25,000 hours)
Hardware Single NVIDIA A40
Time ~24 hours
Cost ~$12

Usage

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("audio.wav")
print(result["text"])

Limitations

  • English only
  • 16kHz audio (other sample rates resampled automatically)
  • May degrade on accented speech, noisy audio, or domain-specific terms

Links

Downloads last month
1,103
Safetensors
Model size
12.6M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mazesmazes/tiny-audio

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(565)
this model
Quantizations
1 model

Dataset used to train mazesmazes/tiny-audio

Space using mazesmazes/tiny-audio 1