Text-to-Speech
On-device text-to-speech running entirely on WebGPU. Kani-TTS is the fast default voice; OuteTTS adds a set of preset voices and voice cloning; Parler-TTS lets you design a voice by describing it in plain text. Audio is synthesized in the browser and never leaves the device.
engine.speak() to turn text into 22.05 kHz audio in the browser.Native Model
| Model | Architecture | Output | Backend | Status |
|---|---|---|---|---|
| Kani-TTS (en) · default | LFM2-350M codec-LM + NVIDIA NanoCodec | 22.05 kHz mono PCM | WebGPU (WGSL) | Live |
| OuteTTS 0.6B | Multi-voice — preset voices + voice cloning | 24 kHz mono PCM | WebGPU (WGSL) | Live |
| Parler-TTS Mini v1 | Voice design — describe a voice in plain text | 44.1 kHz mono PCM | WebGPU (WGSL) | Live |
The default backbone repo is nineninesix/kani-tts-450m-0.2-ft (~492 MB at int4). The first speak() call also lazily downloads the NanoCodec decoder checkpoint (the NeMo 22 kHz MLX codec). Both are cached after the first run. Prefer multiple distinct voices or want to clone one? Switch to OuteTTS (repo OuteAI/OuteTTS-1.0-0.6B). Want to design a voice from a plain-text description instead? Switch to Parler-TTS (repo parler-tts/parler-tts-mini-v1).
Quick Start (React): useTTS
In React, use useTTS from @tryhamster/gerbil/hooks. It lazily loads the engine, synthesizes, plays the audio through Web Audio, and keeps the clip for replay(). It defaults to the Kani-TTS-2 model — no model argument needed:
01"use client";02
03import { useTTS } from "@tryhamster/gerbil/hooks";04
05function Speaker() {06 const { speak, replay, isSynthesizing, isPlaying, hasAudio, rtf } = useTTS();07
08 return (09 <div>10 <button11 onClick={() =>12 speak("Hello from Gerbil! This runs entirely on-device.", {13 voice: "en_us", // language/accent tag (see Voices below)14 temperature: 1.0, // sampling controls (see Controls below)15 topP: 0.95,16 repetitionPenalty: 1.1,17 })18 }19 disabled={isSynthesizing || isPlaying}20 >21 {isSynthesizing ? "Synthesizing…" : "Speak"}22 </button>23 {hasAudio && <button onClick={replay} disabled={isPlaying}>Play again</button>}24 {rtf && <span>{rtf.toFixed(1)}× real-time</span>}25 </div>26 );27}Advanced: engine.speak() (vanilla JS / Node)
Outside React, create the engine directly and call speak(). It returns the raw PCM samples and sample rate — no <audio>-ready file, so wrap the Float32Array in a Web Audio AudioBuffer at 22050 Hz to play it:
01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";02
03const engine = await WebGPUEngine.create({04 repo: "nineninesix/kani-tts-450m-0.2-ft",05});06
07const { pcm, sampleRate, frames, audioSeconds } = await engine.speak(08 "Hello from Gerbil! This runs entirely on-device.",09 { languageTag: "en_us", temperature: 1.0, topP: 0.95, repetitionPenalty: 1.1 },10);11// pcm → Float32Array, mono, [-1, 1] · sampleRate → 2205012
13// Play through Web Audio:14const ctx = new AudioContext();15const buffer = ctx.createBuffer(1, pcm.length, sampleRate);16buffer.copyToChannel(pcm, 0);17const source = ctx.createBufferSource();18source.buffer = buffer;19source.connect(ctx.destination);20source.start();Voices
The default checkpoint (kani-tts-450m-0.2-ft, Apache-2.0) ships a single English voice — call ttsVoices() (or KaniTTS.availableVoices()) to see what the loaded checkpoint supports. Some checkpoints add selectable accent tags you pass via the voice option; the legacy nineninesix/kani-tts-2-en checkpoint, for example, exposes six US/UK accents (en_us, en_nyork, en_oakl, en_glasg, en_bost, en_scou). On a single-voice checkpoint the voice option is simply ignored.
On the default Kani checkpoint, beyond accent selection the expressive range comes from the sampling controls below, and audio is the codec's native 22.05 kHz (fixed). For multiple distinct preset voices — and to clone a voice from a short clip — use OuteTTS.
Multi-voice: OuteTTS
OuteTTS-1.0-0.6B is the multi-voice option. It ships six preset voices and supports voice cloning from a short reference clip. Kani-TTS stays the default — select OuteTTS by passing its repo to useTTS:
01"use client";02
03import { useTTS, ttsVoices } from "@tryhamster/gerbil/hooks";04import { DEFAULT_MODELS } from "@tryhamster/gerbil";05
06function OuteSpeaker() {07 // Select OuteTTS via its repo. DEFAULT_MODELS.ttsOute resolves to08 // "OuteAI/OuteTTS-1.0-0.6B"; DEFAULT_MODELS.tts stays the Kani default.09 const { speak, isSynthesizing } = useTTS({ repo: DEFAULT_MODELS.ttsOute });10
11 // Preset voices for the loaded checkpoint (six for OuteTTS).12 const voices = ttsVoices(DEFAULT_MODELS.ttsOute);13
14 return (15 <button16 onClick={() =>17 speak("Now speaking with an OuteTTS preset voice.", {18 voice: voices[0].value, // e.g. "en-female-1-neutral"19 temperature: 1.0,20 topP: 0.95,21 repetitionPenalty: 1.1,22 })23 }24 disabled={isSynthesizing}25 >26 Speak27 </button>28 );29}The hook returns the right sample rate per checkpoint automatically — Kani synthesizes at 22.05 kHz, OuteTTS at 24 kHz — so you never have to hardcode it when wrapping the PCM in a Web Audio buffer.
The same selection works from the one-liner and CLI:
01// One-liner — pass the OuteTTS repo + a preset voice:02import { speak } from "@tryhamster/gerbil";03
04await speak("Hello from OuteTTS!", {05 model: "OuteAI/OuteTTS-1.0-0.6B",06 voice: "en-female-1-neutral",07});# CLI — synthesize with OuteTTS and a preset voice:gerbil speak "Hello from OuteTTS!" \ --model OuteAI/OuteTTS-1.0-0.6B \ --voice en-female-1-neutral \ --out hello.wavPreset voices
Read the available voices at runtime with ttsVoices(DEFAULT_MODELS.ttsOute) (or the OUTE_VOICES export). OuteTTS exposes six presets — a mix of male/female and neutral/expressive English voices — that you pass through the voice option:
| Voice | Character |
|---|---|
| en-female-1-neutral | English female, neutral delivery |
| en-female-2-neutral | English female, neutral delivery |
| en-male-1-neutral | English male, neutral delivery |
| en-male-2-neutral | English male, neutral delivery |
| en-female-1-expressive | English female, expressive delivery |
| en-male-1-expressive | English male, expressive delivery |
Always read the live list from ttsVoices() rather than hardcoding — the presets are advertised by the loaded checkpoint.
Voice cloning
Beyond the presets, OuteTTS can clone a voice from raw audio: provide a short reference clip and its transcript, and the engine encodes it into a speaker it then synthesizes new text in. Provide the clip as a Float32Array / AudioBuffer (or a data URL in the browser) along with the exact words spoken in it:
01import { useTTS } from "@tryhamster/gerbil/hooks";02import { DEFAULT_MODELS } from "@tryhamster/gerbil";03
04function VoiceCloner() {05 const tts = useTTS({ repo: DEFAULT_MODELS.ttsOute });06
07 async function cloneAndSpeak(referenceAudio, referenceTranscript) {08 // Encode the reference clip into a speaker, then synthesize new text in it.09 await tts.speakAs("Now I sound like the reference clip.", {10 referenceAudio, // the short sample clip11 referenceTranscript, // exactly what is said in that clip12 temperature: 1.0,13 topP: 0.95,14 });15 }16
17 return <button onClick={() => cloneAndSpeak(clip, transcript)}>Clone & speak</button>;18}A clean, 5–15 second clip with an accurate transcript clones best. Keep the synthesized text reasonably short for the most natural result. Cloning runs fully on-device — the reference audio never leaves the browser.
Voice design: Parler-TTS
Parler-TTS-mini-v1 takes a different approach to voice: instead of picking from a list of presets or cloning a clip, you describe the voice you want in plain English and the model designs a speaker to match. Kani-TTS stays the default — select Parler by passing its repo to useTTS, then pass your description through the describeVoice option:
01"use client";02
03import { useTTS } from "@tryhamster/gerbil/hooks";04import { DEFAULT_MODELS } from "@tryhamster/gerbil";05
06function ParlerSpeaker() {07 // Select Parler-TTS via its repo. DEFAULT_MODELS.ttsParler resolves to08 // "parler-tts/parler-tts-mini-v1"; DEFAULT_MODELS.tts stays the Kani default.09 const { speak, isSynthesizing } = useTTS({ repo: DEFAULT_MODELS.ttsParler });10
11 return (12 <button13 onClick={() =>14 speak("Now speaking with a voice designed from a description.", {15 // Describe the voice in plain text — Parler builds a speaker to match.16 describeVoice:17 "a warm female voice, slightly slow, very clear studio recording",18 temperature: 1.0,19 topP: 0.95,20 })21 }22 disabled={isSynthesizing}23 >24 Speak25 </button>26 );27}The same selection works from the one-liner and CLI — pass the Parler repo and your describeVoice prompt:
01// One-liner — pass the Parler repo + a voice description:02import { speak } from "@tryhamster/gerbil";03import { DEFAULT_MODELS } from "@tryhamster/gerbil";04
05await speak("Hello from Parler-TTS!", {06 model: DEFAULT_MODELS.ttsParler, // "parler-tts/parler-tts-mini-v1"07 describeVoice:08 "a warm female voice, slightly slow, very clear studio recording",09});# CLI — synthesize with Parler and a voice description:gerbil speak "Hello from Parler-TTS!" \ --model parler-tts/parler-tts-mini-v1 \ --describe-voice "a warm female voice, slightly slow, very clear studio recording" \ --out hello.wavDescribing a voice
The describeVoice string is a natural-language prompt that Parler's Flan-T5 description encoder turns into the conditioning the decoder uses to design the speaker. There's no fixed vocabulary — describe gender, tone, pace, and the recording setting, and Parler synthesizes to match. Cues that tend to land well: timbre (warm, deep, bright, gravelly), pace (slightly slow, measured, quick), and recording quality (very clear studio recording, close-sounding, no background noise). A few examples:
| describeVoice |
|---|
| A warm female voice, speaking slightly slowly, in a very clear and close-sounding studio recording. |
| A deep, calm male voice with a measured pace, recorded in a quiet room with no background noise. |
| A bright, energetic female voice speaking quickly and expressively, with crisp studio-quality audio. |
When you omit describeVoice, a neutral default description is used.
Random voice
For a quick demo — or a "🎲 random voice" button — let the engine pick a description for you. Gerbil.inventVoiceDescription(hint?) writes a fresh description with a loaded text model (optionally steered by a hint), and ParlerTTS.randomVoiceDescription() pulls one from a host-side template bank with no model call. Pass the result straight into describeVoice:
01import { Gerbil, DEFAULT_MODELS } from "@tryhamster/gerbil";02
03const g = new Gerbil();04
05// Let a text model invent a description (optionally steered by a hint):06const describeVoice = await g.inventVoiceDescription("a calm meditation guide");07
08await g.speak("Breathe in, and slowly let it go.", {09 model: DEFAULT_MODELS.ttsParler,10 describeVoice,11});Parler-TTS-mini-v1 is a single-checkpoint model (~2.6 GB, downloaded on the first speak() and cached after) and outputs 44.1 kHz mono PCM — the hook returns the right sample rate automatically, so you never hardcode it. Parler shipped in engine 1.4.0; the describeVoice passthrough on the React hook landed in 1.4.1.
Sampling Controls
All three models emit audio tokens autoregressively, so the same nucleus-sampling knobs that shape text generation shape how the speech is delivered — they apply to Kani-TTS, OuteTTS, and Parler-TTS alike:
| Option | Default | Effect |
|---|---|---|
| temperature | 1.0 | Higher → more varied, expressive delivery; lower → flatter and more stable. |
| topP | 0.95 | Nucleus threshold — the cumulative probability mass kept when sampling each audio token. |
| topK | 0 | Keep only the K highest-probability codes per step (0 = off). Combine with topP to tighten sampling. |
| repetitionPenalty | 1.1 | Discourages repeated codes — reduces stutters and looping artifacts. |
| maxFrames | — | Optional hard cap on the number of audio frames (caps the clip duration). |
Speak replies in a full voice loop: useVoiceChat
TTS is one half of a conversation. To wire speech synthesis to a chat model and a microphone in one step, use useVoiceChat — a complete on-device voice assistant that listens, thinks, and speaks its reply, all on WebGPU with no cloud round-trip:
01"use client";02
03import { useVoiceChat } from "@tryhamster/gerbil/hooks";04
05function VoiceAssistant() {06 const { messages, start, stop, isListening, isSpeaking } = useVoiceChat({07 voice: "en_us", // spoken with Kani-TTS-2, same as useTTS08 });09
10 return (11 <div>12 <button onClick={() => (isListening ? stop() : start())}>13 {isListening ? "Listening…" : isSpeaking ? "Speaking…" : "Tap to talk"}14 </button>15 {messages.map((m, i) => (16 <p key={i}><strong>{m.role}:</strong> {m.content}</p>17 ))}18 </div>19 );20}It composes useSTT, useChat, and useTTS for you — pass speak: false for a text-only loop, or your own ttsModel to swap the voice. See the React Hooks reference for the full surface.
How It Works
Kani-TTS-2 is a two-stage model. The codec-LM backbone (an LFM2-350M body with frame-level positions and learnable per-layer RoPE) autoregressively emits four NanoCodec audio tokens per frame. Those codes are then run through the NanoCodec decoder (FSQ dequant + a causal HiFi-GAN) to produce 22.05 kHz PCM. Everything runs on the same WebGPU engine as chat, vision, embeddings, and speech-to-text, entirely on-device.
Requirements
Native TTS needs WebGPU: Safari 26+ (iOS 26+), Chrome/Edge 113+, or Firefox 141+. Gate on navigator.gpu before offering the feature, and fall back gracefully when it's missing.
Looking for Speech-to-Text?
Native speech-to-text is available via Moonshine — see the STT docs.
Next Steps
- Speech-to-Text → — transcribe audio with native Moonshine
- Browser engine → — chat, vision, and embeddings on WebGPU
- Models → — the native model lineup and sizes