Speech-to-Text

Local speech-to-text transcription using Whisper ONNX models via transformers.js. No API keys required - everything runs on-device.

Quick Start

Node.js

stt-basic.ts
01import { Gerbil } from "@tryhamster/gerbil";
02import { readFileSync } from "fs";
03
04const g = new Gerbil();
05
06// Transcribe a WAV file
07const audioData = new Uint8Array(readFileSync("audio.wav"));
08const result = await g.transcribe(audioData);
09console.log(result.text);

CLI

terminal
# Transcribe audio file
gerbil transcribe audio.wav
# With timestamps
gerbil transcribe audio.wav --timestamps
# List available models
gerbil transcribe --list-models
# Use a different model
gerbil transcribe audio.wav --model whisper-base.en

Available Models

ModelSizeLanguagesDescription
whisper-tiny.en39MEnglishFastest, good for simple audio
whisper-tiny39MMultiMultilingual tiny model
whisper-base.en74MEnglishGood balance of speed/accuracy
whisper-base74MMultiMultilingual base model
whisper-small.en244MEnglishHigh quality transcription
whisper-small244MMultiHigh quality multilingual
whisper-large-v3-turbo809M80+ langsBest quality, 5.4x faster than v3
Recommended: Use whisper-large-v3-turbo for best quality with excellent speed. For quick demos, whisper-tiny.en offers the fastest loading.

API Reference

Gerbil Class Methods

stt-api.ts
01// Load STT model explicitly (optional - auto-loads on first transcribe)
02await g.loadSTT("whisper-tiny.en", {
03 onProgress: (p) => console.log(p.status),
04});
05
06// Transcribe audio
07const result = await g.transcribe(audioData, {
08 language: "en", // Language hint (multilingual models only)
09 timestamps: true, // Return segment timestamps
10 onProgress: (p) => console.log(p.status),
11});
12
13// Result structure
14console.log(result.text); // Full transcription
15console.log(result.duration); // Audio duration in seconds
16console.log(result.totalTime); // Processing time in ms
17console.log(result.segments); // Timestamped segments (if requested)
18
19// Check if loaded
20console.log(g.isSTTLoaded());
21
22// List available models
23const models = await g.listSTTModels();

Audio Input Formats

audio-formats.ts
// WAV file (Uint8Array) - automatically decoded
const wavData = new Uint8Array(fs.readFileSync("audio.wav"));
await g.transcribe(wavData);
// Raw audio (Float32Array at 16kHz mono)
const raw16k = new Float32Array([...]); // 16kHz mono samples
await g.transcribe(raw16k);
// WAV at any sample rate - automatically resampled to 16kHz
const wav44k = new Uint8Array(fs.readFileSync("audio-44khz.wav"));
await g.transcribe(wav44k); // Resampled internally

Transcription with Timestamps

timestamps.ts
const result = await g.transcribe(audioData, { timestamps: true });
for (const segment of result.segments || []) {
console.log(`[${segment.start.toFixed(1)}s - ${segment.end.toFixed(1)}s] ${segment.text}`);
}
// [0.0s - 5.2s] Hello world, this is a test
// [5.2s - 8.0s] of the transcription system

Language Detection (Multilingual Models)

language-detection.ts
// Use multilingual model
await g.loadSTT("whisper-base");
// Let Whisper detect language
const result = await g.transcribe(audioData);
console.log(result.language); // "en", "es", "fr", etc.
// Or provide a language hint
const spanishResult = await g.transcribe(audioData, { language: "es" });

Direct WhisperSTT Class

For lower-level control:

whisper-stt.ts
01import { WhisperSTT, decodeWav, resampleAudio } from "@tryhamster/gerbil/core/stt";
02
03const stt = new WhisperSTT("whisper-tiny.en");
04
05await stt.load({
06 onProgress: (p) => console.log(p.status),
07});
08
09const result = await stt.transcribe(audioData, {
10 timestamps: true,
11});
12
13console.log(result.text);
14
15stt.dispose();

Audio Utilities

audio-utils.ts
import { decodeWav, resampleAudio } from "@tryhamster/gerbil/core/stt";
// Decode WAV to Float32Array
const { audio, sampleRate } = decodeWav(wavUint8Array);
// Resample to 16kHz (Whisper requirement)
const audio16k = resampleAudio(audio, sampleRate, 16000);

Streaming Transcription (Core)

Transcribe audio in real-time as it's recorded, instead of waiting until recording stops. Perfect for live captioning, call transcription, and voice note taking.

streaming-core.ts
01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Create a streaming transcription session
06const session = await g.createStreamingTranscription({
07 chunkDuration: 1500, // Transcribe every 1.5 seconds
08 onChunk: (text, idx) => {
09 console.log(`Chunk ${idx}: ${text}`);
10 },
11 onTranscript: (fullText) => {
12 console.log("Full transcript:", fullText);
13 },
14});
15
16// Start the session
17session.start();
18
19// Feed audio as it comes in (Float32Array at 16kHz)
20session.feedAudio(audioChunk);
21
22// ... continue feeding audio ...
23
24// Stop and get final transcript
25const finalText = await session.stop();
26console.log("Final:", finalText);

StreamingMicrophone (Node.js)

For Node.js, use StreamingMicrophone to capture audio in real-time:

streaming-mic.ts
01import { Gerbil, StreamingMicrophone } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Create streaming session
06const session = await g.createStreamingTranscription({
07 chunkDuration: 1500,
08 onTranscript: (text) => process.stdout.write(`\r> ${text}`),
09});
10
11// Create streaming microphone (requires SoX)
12const mic = new StreamingMicrophone({
13 sampleRate: 16000,
14 onAudio: (chunk) => session.feedAudio(chunk),
15});
16
17session.start();
18mic.start();
19
20// Record for 10 seconds
21await new Promise(r => setTimeout(r, 10000));
22
23mic.stop();
24const transcript = await session.stop();
25console.log("\nFinal:", transcript);

Streaming Options

OptionTypeDefaultDescription
chunkDurationnumber1500Milliseconds between transcriptions
onChunk(text, idx) => void-Callback for each chunk result
onTranscript(fullText) => void-Callback with full accumulated transcript
Note: Shorter chunkDuration = more responsive but more processing overhead. Each chunk is transcribed independently. StreamingMicrophone requires SoX to be installed.

React Hook (Browser)

Use useVoiceInput for browser-based voice recording and transcription:

VoiceInput.tsx
01import { useVoiceInput } from "@tryhamster/gerbil/browser";
02
03function VoiceInput() {
04 const {
05 startRecording,
06 stopRecording,
07 isRecording,
08 isTranscribing,
09 transcript,
10 error,
11 } = useVoiceInput({
12 model: "whisper-tiny.en",
13 onTranscript: (text) => console.log("User said:", text),
14 });
15
16 return (
17 <div>
18 <button onClick={isRecording ? stopRecording : startRecording}>
19 {isRecording ? "🔴 Stop" : "🎤 Record"}
20 </button>
21 {isTranscribing && <span>Transcribing...</span>}
22 {transcript && <p>{transcript}</p>}
23 </div>
24 );
25}

Hook Options

useVoiceInput.ts
01const {
02 startRecording, // () => Promise<void> - start recording
03 stopRecording, // () => Promise<string> - stop and transcribe
04 cancelRecording, // () => void - stop without transcribing
05 transcribe, // (audio: Float32Array) => Promise<string>
06 isRecording, // boolean - currently recording
07 isTranscribing, // boolean - transcribing audio
08 isLoading, // boolean - model loading
09 isReady, // boolean - model ready
10 transcript, // string - latest result
11 loadingProgress, // STTProgress - loading status
12 error, // string | null
13 load, // () => void - manually load model
14} = useVoiceInput({
15 model: "whisper-tiny.en", // STT model ID
16 autoLoad: false, // loads on first record
17 onReady: () => {}, // called when model ready
18 onTranscript: (text) => {},// called on each transcription
19 onError: (error) => {}, // called on errors
20 onProgress: (p) => {}, // loading progress
21});

Streaming Transcription

Transcribe audio in real-time as the user speaks, instead of waiting until they stop recording. Perfect for live captioning, call transcription, and voice note taking.

LiveTranscription.tsx
01import { useVoiceInput } from "@tryhamster/gerbil/browser";
02
03function LiveTranscription() {
04 const {
05 startRecording,
06 stopRecording,
07 isRecording,
08 transcript, // Full text so far
09 streamingChunk, // Latest chunk
10 chunkCount,
11 isTranscribing,
12 } = useVoiceInput({
13 streaming: true,
14 chunkDuration: 1500, // Transcribe every 1.5 seconds
15 onChunk: (text, idx) => {
16 console.log(`Chunk ${idx}: ${text}`);
17 },
18 });
19
20 return (
21 <div>
22 <button onClick={isRecording ? stopRecording : startRecording}>
23 {isRecording ? "Stop" : "Start Live Transcription"}
24 </button>
25
26 {isTranscribing && <span>Processing...</span>}
27 {streamingChunk && <p style={{ opacity: 0.6 }}>"{streamingChunk}"</p>}
28
29 <div>
30 <strong>Full Transcript ({chunkCount} chunks):</strong>
31 <p>{transcript}</p>
32 </div>
33 </div>
34 );
35}

Streaming Options

OptionTypeDefaultDescription
streamingbooleanfalseEnable real-time transcription
chunkDurationnumber1500Milliseconds between transcriptions
onChunk(text, idx) => void-Callback for each chunk result
Note: Shorter chunkDuration = more responsive but more processing overhead. Each chunk is transcribed independently (no context from previous chunks).

useVoiceChat Hook (Full Voice Conversation)

For complete voice-to-voice conversations (STT → LLM → TTS):

VoiceAssistant.tsx
01import { useVoiceChat } from "@tryhamster/gerbil/browser";
02
03function VoiceAssistant() {
04 const {
05 messages,
06 startListening,
07 stopListening,
08 isListening,
09 isSpeaking,
10 stage, // "idle" | "listening" | "transcribing" | "thinking" | "speaking"
11 isReady,
12 } = useVoiceChat({
13 llmModel: "qwen3-0.6b",
14 sttModel: "whisper-tiny.en",
15 voice: "af_bella",
16 system: "You are a helpful voice assistant.",
17 });
18
19 return (
20 <div>
21 {messages.map(m => <p key={m.id}>{m.role}: {m.content}</p>)}
22 <button
23 onMouseDown={startListening}
24 onMouseUp={stopListening}
25 >
26 {stage === "idle" ? "🎤 Hold to Speak" : stage}
27 </button>
28 </div>
29 );
30}

CLI Commands

Transcribe

terminal
# Basic transcription
gerbil transcribe recording.wav
# With timestamps
gerbil transcribe recording.wav --timestamps
# Save to file
gerbil transcribe recording.wav --output transcript.txt
# Use specific model
gerbil transcribe recording.wav --model whisper-base.en
# Multilingual with language hint
gerbil transcribe recording.wav --model whisper-base --language es
# List models
gerbil transcribe --list-models

Voice Chat (STT → LLM → TTS)

Complete voice conversation loop from audio file:

terminal
# Voice chat with audio file
gerbil voice question.wav
# Customize models and voice
gerbil voice question.wav --model qwen3-1.7b --voice bf_emma
# With custom system prompt
gerbil voice question.wav --system "You are a pirate. Speak like one!"
# Enable thinking mode
gerbil voice question.wav --thinking

This command transcribes your audio (Whisper), generates a response (LLM), and speaks the response (Kokoro TTS).

Performance

Transcription speed on Apple M1:

ModelAudio LengthTimeSpeed
whisper-tiny.en11s~250ms44x realtime
whisper-base.en11s~500ms22x realtime
whisper-small.en11s~1.5s7x realtime

Troubleshooting

"Unsupported bit depth"

Only 16-bit WAV files are supported. Convert with:

terminal
ffmpeg -i audio.mp3 -ar 16000 -ac 1 -sample_fmt s16 output.wav

Slow first transcription

The first call downloads the model (~40-250MB). Subsequent calls use the cached model.

Memory usage

Whisper models require:

  • tiny: ~150MB RAM
  • base: ~300MB RAM
  • small: ~600MB RAM
  • large-v3-turbo: ~1.5GB RAM

Microphone Recording (Node.js)

Record audio directly from your system microphone in Node.js using SoX.

Requirements

Requires SoX (Sound eXchange) to be installed:

terminal
# macOS
brew install sox
# Ubuntu/Debian
sudo apt install sox
# Windows
# Download from https://sox.sourceforge.net/

Listen API

One-liner to record and transcribe:

listen.ts
01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04
05// Check if microphone is available
06const available = await g.isMicrophoneAvailable();
07
08// Record and transcribe in one call
09const result = await g.listen(5000, { // 5 seconds
10 onProgress: (status) => console.log(status),
11});
12console.log(result.text);

Low-level Microphone API

microphone.ts
01import { Microphone, isSoxAvailable } from "@tryhamster/gerbil";
02
03if (!isSoxAvailable()) {
04 console.log("Install SoX first!");
05 process.exit(1);
06}
07
08const mic = new Microphone({ sampleRate: 16000 });
09await mic.start();
10
11// Record for 5 seconds
12await new Promise(r => setTimeout(r, 5000));
13
14const { audio, sampleRate, duration } = await mic.stop();
15// audio = Float32Array at 16kHz, ready for Whisper

Next Steps