Text-to-Speech

Name: Gerbil
Author: Gerbil

On-device text-to-speech running entirely on WebGPU. Kani-TTS is the fast default voice; OuteTTS adds a set of preset voices and voice cloning; Parler-TTS lets you design a voice by describing it in plain text. Audio is synthesized in the browser and never leaves the device.

Speech synthesis runs fully on-device on the same WebGPU engine as chat, vision, embeddings, and speech-to-text. Call engine.speak() to turn text into 22.05 kHz audio in the browser.

Native Model

Model	Architecture	Output	Backend	Status
Kani-TTS (en) · default	LFM2-350M codec-LM + NVIDIA NanoCodec	22.05 kHz mono PCM	WebGPU (WGSL)	Live
OuteTTS 0.6B	Multi-voice — preset voices + voice cloning	24 kHz mono PCM	WebGPU (WGSL)	Live
Parler-TTS Mini v1	Voice design — describe a voice in plain text	44.1 kHz mono PCM	WebGPU (WGSL)	Live

The default backbone repo is nineninesix/kani-tts-450m-0.2-ft (~492 MB at int4). The first speak() call also lazily downloads the NanoCodec decoder checkpoint (the NeMo 22 kHz MLX codec). Both are cached after the first run. Prefer multiple distinct voices or want to clone one? Switch to OuteTTS (repo OuteAI/OuteTTS-1.0-0.6B). Want to design a voice from a plain-text description instead? Switch to Parler-TTS (repo parler-tts/parler-tts-mini-v1).

Quick Start (React): `useTTS`

In React, use useTTS from @tryhamster/gerbil/hooks. It lazily loads the engine, synthesizes, plays the audio through Web Audio, and keeps the clip for replay(). It defaults to the Kani-TTS-2 model — no model argument needed:

Speaker.tsx

01"use client";
02
03import { useTTS } from "@tryhamster/gerbil/hooks";
04
05function Speaker() {
06  const { speak, replay, isSynthesizing, isPlaying, hasAudio, rtf } = useTTS();
07
08  return (
09    <div>
10      <button
11        onClick={() =>
12          speak("Hello from Gerbil! This runs entirely on-device.", {
13            voice: "en_us",     // language/accent tag (see Voices below)
14            temperature: 1.0,   // sampling controls (see Controls below)
15            topP: 0.95,
16            repetitionPenalty: 1.1,
17          })
18        }
19        disabled={isSynthesizing || isPlaying}
20      >
21        {isSynthesizing ? "Synthesizing…" : "Speak"}
22      </button>
23      {hasAudio && <button onClick={replay} disabled={isPlaying}>Play again</button>}
24      {rtf && <span>{rtf.toFixed(1)}× real-time</span>}
25    </div>
26  );
27}

Advanced: `engine.speak()` (vanilla JS / Node)

Outside React, create the engine directly and call speak(). It returns the raw PCM samples and sample rate — no <audio>-ready file, so wrap the Float32Array in a Web Audio AudioBuffer at 22050 Hz to play it:

speak.ts

01import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
02
03const engine = await WebGPUEngine.create({
04  repo: "nineninesix/kani-tts-450m-0.2-ft",
05});
06
07const { pcm, sampleRate, frames, audioSeconds } = await engine.speak(
08  "Hello from Gerbil! This runs entirely on-device.",
09  { languageTag: "en_us", temperature: 1.0, topP: 0.95, repetitionPenalty: 1.1 },
10);
11// pcm → Float32Array, mono, [-1, 1] · sampleRate → 22050
12
13// Play through Web Audio:
14const ctx = new AudioContext();
15const buffer = ctx.createBuffer(1, pcm.length, sampleRate);
16buffer.copyToChannel(pcm, 0);
17const source = ctx.createBufferSource();
18source.buffer = buffer;
19source.connect(ctx.destination);
20source.start();

Voices

The default checkpoint (kani-tts-450m-0.2-ft, Apache-2.0) ships a single English voice — call ttsVoices() (or KaniTTS.availableVoices()) to see what the loaded checkpoint supports. Some checkpoints add selectable accent tags you pass via the voice option; the legacy nineninesix/kani-tts-2-en checkpoint, for example, exposes six US/UK accents (en_us, en_nyork, en_oakl, en_glasg, en_bost, en_scou). On a single-voice checkpoint the voice option is simply ignored.

On the default Kani checkpoint, beyond accent selection the expressive range comes from the sampling controls below, and audio is the codec's native 22.05 kHz (fixed). For multiple distinct preset voices — and to clone a voice from a short clip — use OuteTTS.

Multi-voice: OuteTTS

OuteTTS-1.0-0.6B is the multi-voice option. It ships six preset voices and supports voice cloning from a short reference clip. Kani-TTS stays the default — select OuteTTS by passing its repo to useTTS:

OuteSpeaker.tsx

01"use client";
02
03import { useTTS, ttsVoices } from "@tryhamster/gerbil/hooks";
04import { DEFAULT_MODELS } from "@tryhamster/gerbil";
05
06function OuteSpeaker() {
07  // Select OuteTTS via its repo. DEFAULT_MODELS.ttsOute resolves to
08  // "OuteAI/OuteTTS-1.0-0.6B"; DEFAULT_MODELS.tts stays the Kani default.
09  const { speak, isSynthesizing } = useTTS({ repo: DEFAULT_MODELS.ttsOute });
10
11  // Preset voices for the loaded checkpoint (six for OuteTTS).
12  const voices = ttsVoices(DEFAULT_MODELS.ttsOute);
13
14  return (
15    <button
16      onClick={() =>
17        speak("Now speaking with an OuteTTS preset voice.", {
18          voice: voices[0].value, // e.g. "en-female-1-neutral"
19          temperature: 1.0,
20          topP: 0.95,
21          repetitionPenalty: 1.1,
22        })
23      }
24      disabled={isSynthesizing}
25    >
26      Speak
27    </button>
28  );
29}

The hook returns the right sample rate per checkpoint automatically — Kani synthesizes at 22.05 kHz, OuteTTS at 24 kHz — so you never have to hardcode it when wrapping the PCM in a Web Audio buffer.

The same selection works from the one-liner and CLI:

oute-one-liner.ts

01// One-liner — pass the OuteTTS repo + a preset voice:
02import { speak } from "@tryhamster/gerbil";
03
04await speak("Hello from OuteTTS!", {
05  model: "OuteAI/OuteTTS-1.0-0.6B",
06  voice: "en-female-1-neutral",
07});

terminal

# CLI — synthesize with OuteTTS and a preset voice:
gerbil speak "Hello from OuteTTS!" \
  --model OuteAI/OuteTTS-1.0-0.6B \
  --voice en-female-1-neutral \
  --out hello.wav

Preset voices

Read the available voices at runtime with ttsVoices(DEFAULT_MODELS.ttsOute) (or the OUTE_VOICES export). OuteTTS exposes six presets — a mix of male/female and neutral/expressive English voices — that you pass through the voice option:

Voice	Character
en-female-1-neutral	English female, neutral delivery
en-female-2-neutral	English female, neutral delivery
en-male-1-neutral	English male, neutral delivery
en-male-2-neutral	English male, neutral delivery
en-female-1-expressive	English female, expressive delivery
en-male-1-expressive	English male, expressive delivery

Always read the live list from ttsVoices() rather than hardcoding — the presets are advertised by the loaded checkpoint.

Voice cloning

Beyond the presets, OuteTTS can clone a voice from raw audio: provide a short reference clip and its transcript, and the engine encodes it into a speaker it then synthesizes new text in. Provide the clip as a Float32Array / AudioBuffer (or a data URL in the browser) along with the exact words spoken in it:

VoiceCloner.tsx

01import { useTTS } from "@tryhamster/gerbil/hooks";
02import { DEFAULT_MODELS } from "@tryhamster/gerbil";
03
04function VoiceCloner() {
05  const tts = useTTS({ repo: DEFAULT_MODELS.ttsOute });
06
07  async function cloneAndSpeak(referenceAudio, referenceTranscript) {
08    // Encode the reference clip into a speaker, then synthesize new text in it.
09    await tts.speakAs("Now I sound like the reference clip.", {
10      referenceAudio,        // the short sample clip
11      referenceTranscript,   // exactly what is said in that clip
12      temperature: 1.0,
13      topP: 0.95,
14    });
15  }
16
17  return <button onClick={() => cloneAndSpeak(clip, transcript)}>Clone & speak</button>;
18}

A clean, 5–15 second clip with an accurate transcript clones best. Keep the synthesized text reasonably short for the most natural result. Cloning runs fully on-device — the reference audio never leaves the browser.

Voice design: Parler-TTS

Parler-TTS-mini-v1 takes a different approach to voice: instead of picking from a list of presets or cloning a clip, you describe the voice you want in plain English and the model designs a speaker to match. Kani-TTS stays the default — select Parler by passing its repo to useTTS, then pass your description through the describeVoice option:

ParlerSpeaker.tsx

01"use client";
02
03import { useTTS } from "@tryhamster/gerbil/hooks";
04import { DEFAULT_MODELS } from "@tryhamster/gerbil";
05
06function ParlerSpeaker() {
07  // Select Parler-TTS via its repo. DEFAULT_MODELS.ttsParler resolves to
08  // "parler-tts/parler-tts-mini-v1"; DEFAULT_MODELS.tts stays the Kani default.
09  const { speak, isSynthesizing } = useTTS({ repo: DEFAULT_MODELS.ttsParler });
10
11  return (
12    <button
13      onClick={() =>
14        speak("Now speaking with a voice designed from a description.", {
15          // Describe the voice in plain text — Parler builds a speaker to match.
16          describeVoice:
17            "a warm female voice, slightly slow, very clear studio recording",
18          temperature: 1.0,
19          topP: 0.95,
20        })
21      }
22      disabled={isSynthesizing}
23    >
24      Speak
25    </button>
26  );
27}

The same selection works from the one-liner and CLI — pass the Parler repo and your describeVoice prompt:

parler-one-liner.ts

01// One-liner — pass the Parler repo + a voice description:
02import { speak } from "@tryhamster/gerbil";
03import { DEFAULT_MODELS } from "@tryhamster/gerbil";
04
05await speak("Hello from Parler-TTS!", {
06  model: DEFAULT_MODELS.ttsParler, // "parler-tts/parler-tts-mini-v1"
07  describeVoice:
08    "a warm female voice, slightly slow, very clear studio recording",
09});

terminal

# CLI — synthesize with Parler and a voice description:
gerbil speak "Hello from Parler-TTS!" \
  --model parler-tts/parler-tts-mini-v1 \
  --describe-voice "a warm female voice, slightly slow, very clear studio recording" \
  --out hello.wav

Describing a voice

The describeVoice string is a natural-language prompt that Parler's Flan-T5 description encoder turns into the conditioning the decoder uses to design the speaker. There's no fixed vocabulary — describe gender, tone, pace, and the recording setting, and Parler synthesizes to match. Cues that tend to land well: timbre (warm, deep, bright, gravelly), pace (slightly slow, measured, quick), and recording quality (very clear studio recording, close-sounding, no background noise). A few examples:

describeVoice
A warm female voice, speaking slightly slowly, in a very clear and close-sounding studio recording.
A deep, calm male voice with a measured pace, recorded in a quiet room with no background noise.
A bright, energetic female voice speaking quickly and expressively, with crisp studio-quality audio.

When you omit describeVoice, a neutral default description is used.

Random voice

For a quick demo — or a "🎲 random voice" button — let the engine pick a description for you. Gerbil.inventVoiceDescription(hint?) writes a fresh description with a loaded text model (optionally steered by a hint), and ParlerTTS.randomVoiceDescription() pulls one from a host-side template bank with no model call. Pass the result straight into describeVoice:

random-voice.ts

01import { Gerbil, DEFAULT_MODELS } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Let a text model invent a description (optionally steered by a hint):
06const describeVoice = await g.inventVoiceDescription("a calm meditation guide");
07
08await g.speak("Breathe in, and slowly let it go.", {
09  model: DEFAULT_MODELS.ttsParler,
10  describeVoice,
11});

Parler-TTS-mini-v1 is a single-checkpoint model (~2.6 GB, downloaded on the first speak() and cached after) and outputs 44.1 kHz mono PCM — the hook returns the right sample rate automatically, so you never hardcode it. Parler shipped in engine 1.4.0; the describeVoice passthrough on the React hook landed in 1.4.1.

Sampling Controls

All three models emit audio tokens autoregressively, so the same nucleus-sampling knobs that shape text generation shape how the speech is delivered — they apply to Kani-TTS, OuteTTS, and Parler-TTS alike:

Option	Default	Effect
temperature	1.0	Higher → more varied, expressive delivery; lower → flatter and more stable.
topP	0.95	Nucleus threshold — the cumulative probability mass kept when sampling each audio token.
topK	0	Keep only the K highest-probability codes per step (0 = off). Combine with topP to tighten sampling.
repetitionPenalty	1.1	Discourages repeated codes — reduces stutters and looping artifacts.
maxFrames	—	Optional hard cap on the number of audio frames (caps the clip duration).

Speak replies in a full voice loop: `useVoiceChat`

TTS is one half of a conversation. To wire speech synthesis to a chat model and a microphone in one step, use useVoiceChat — a complete on-device voice assistant that listens, thinks, and speaks its reply, all on WebGPU with no cloud round-trip:

VoiceAssistant.tsx

01"use client";
02
03import { useVoiceChat } from "@tryhamster/gerbil/hooks";
04
05function VoiceAssistant() {
06  const { messages, start, stop, isListening, isSpeaking } = useVoiceChat({
07    voice: "en_us", // spoken with Kani-TTS-2, same as useTTS
08  });
09
10  return (
11    <div>
12      <button onClick={() => (isListening ? stop() : start())}>
13        {isListening ? "Listening…" : isSpeaking ? "Speaking…" : "Tap to talk"}
14      </button>
15      {messages.map((m, i) => (
16        <p key={i}><strong>{m.role}:</strong> {m.content}</p>
17      ))}
18    </div>
19  );
20}

It composes useSTT, useChat, and useTTS for you — pass speak: false for a text-only loop, or your own ttsModel to swap the voice. See the React Hooks reference for the full surface.

How It Works

Kani-TTS-2 is a two-stage model. The codec-LM backbone (an LFM2-350M body with frame-level positions and learnable per-layer RoPE) autoregressively emits four NanoCodec audio tokens per frame. Those codes are then run through the NanoCodec decoder (FSQ dequant + a causal HiFi-GAN) to produce 22.05 kHz PCM. Everything runs on the same WebGPU engine as chat, vision, embeddings, and speech-to-text, entirely on-device.

Requirements

Native TTS needs WebGPU: Safari 26+ (iOS 26+), Chrome/Edge 113+, or Firefox 141+. Gate on navigator.gpu before offering the feature, and fall back gracefully when it's missing.

Looking for Speech-to-Text?

Native speech-to-text is available via Moonshine — see the STT docs.

Next Steps

Speech-to-Text → — transcribe audio with native Moonshine
Browser engine → — chat, vision, and embeddings on WebGPU
Models → — the native model lineup and sizes