How It Works

Technical deep-dive into how Gerbil works under the hood.

System Overview

Key Design Decisions

1. transformers.js as the Foundation

We use Hugging Face transformers.js which provides pre-converted ONNX models, exact tokenizers, and a unified API across all backends.

2. WebGPU First

WebGPU provides 5-10x speedup over CPU. In browsers, we use native WebGPU. In Node.js, we use headless Chrome as a WebGPU accelerator (ChromeGPUBackend).

3. Quantization for Speed

All models use quantized weights: q4f16 (4-bit weights, fp16 compute) for WebGPU, q4 for CPU. This reduces model size by ~4x.

4. Streaming via Web Workers

Browser inference runs in a Web Worker to keep the UI responsive, stream tokens in real-time, and isolate GPU memory from the main thread.

Inference Pipeline

The inference stack consists of three layers:

Execution Backends

BackendEnvironmentSpeedNotes
WebGPUBrowser, Chrome~100-150 tok/sFastest, requires GPU
CPUNode.js~30-60 tok/sUses SIMD, good on Apple Silicon
WASMBrowser fallback~5-10 tok/sWorks everywhere

Quantization Types

TypeWeightsComputeSize ReductionUse Case
fp3232-bit float32-bit1x (baseline)Training
fp1616-bit float16-bit2xGPU inference
q4f164-bit int16-bit~4xWebGPU inference
q44-bit int32-bit~4xCPU inference

WebGPU Acceleration

WebGPU is a modern compute API that provides access to GPU hardware. Gerbil uses it for fast inference in both browsers and Node.js.

Browser Path

Node.js WebGPU Path (ChromeGPUBackend)

Node.js doesn't have native WebGPU, so Gerbil uses headless Chrome as a GPU accelerator:

The fixed port (43724) ensures consistent IndexedDB caching across runs.

Streaming Architecture

LLM inference is computationally intensive. Gerbil runs it in a Web Worker to keep the UI responsive:

Message Protocol

protocol.ts
// Main → Worker
{ type: "load", modelId: "qwen3-0.6b" }
{ type: "generate", messages: [...], options: {...} }
{ type: "interrupt" }
{ type: "reset" }
// Worker → Main
{ status: "loading", message: "Loading model..." }
{ status: "progress", file: "model.onnx", progress: 50 }
{ status: "ready" }
{ status: "token", text: "Hello", state: "answering", tps: 75 }
{ status: "complete", text: "Hello world!", numTokens: 3, tps: 75 }
{ status: "error", error: "Out of memory" }

Thinking State Tracking

For Qwen3 thinking mode, Gerbil tracks whether the model is "thinking" or "answering" by monitoring special tokens:

thinking.ts
const [START_THINKING_TOKEN_ID, END_THINKING_TOKEN_ID] =
tokenizer.encode("<think></think>", { add_special_tokens: false });
let state = "answering";
const tokenCallback = (tokens) => {
const tokenId = Number(tokens[0]);
if (tokenId === START_THINKING_TOKEN_ID) state = "thinking";
if (tokenId === END_THINKING_TOKEN_ID) state = "answering";
};

Model Caching

Models are large (100MB - 500MB). Gerbil caches them locally to avoid re-downloading:

EnvironmentCache LocationMechanism
BrowserIndexedDBtransformers.js built-in
Node.js (CPU)~/.cache/huggingface/hubtransformers.js built-in
Node.js (WebGPU)Chrome's IndexedDBVia ChromeGPUBackend

Cache Behavior

1

First load

Downloads from Hugging Face Hub (~15-30s depending on model size)

2

Subsequent loads

Reads from local cache (~1-2s for browser, ~0.5s for Node.js)

Clearing Cache

Browser (in DevTools console):

Terminal
indexedDB.deleteDatabase("transformers-cache");

Node.js CLI:

Terminal
npx @tryhamster/gerbil cache --clean

Memory Management

Gerbil automatically manages memory to prevent leaks while maintaining performance. For WebGPU inference, memory is bounded and monitored.

Automatic KV Cache Reset

The KV cache automatically resets when it exceeds the model's context length (2048 tokens for Qwen3). This prevents unbounded memory growth:

auto-reset.ts
// Memory automatically resets after ~2048 tokens
// No action needed - happens transparently
const gerbil = new Gerbil();
await gerbil.loadModel("qwen3-0.6b");
// Long conversations work fine - auto-reset preserves context window
for (let i = 0; i < 100; i++) {
await gerbil.generate("Tell me something interesting");
}

Memory Bounds

MetricLimitNotes
Per-page maximum~4GBcontext length × token size
Concurrent pages5 maxMultiple Gerbil instances
Typical usage< 2GBMost conversations < 500 tokens

Memory API

memory-api.ts
// Check memory usage (WebGPU only)
const mem = await gerbil.getMemoryUsage();
if (mem) {
console.log(`Using ${mem.usedGB.toFixed(1)}GB / ${mem.totalGB.toFixed(1)}GB`);
}
// Clear KV cache manually (resets conversation context)
await gerbil.clearCache();
// Auto-cleanup if threshold exceeded
const didCleanup = await gerbil.checkMemoryAndCleanup(8); // 8GB threshold
// Always dispose when done
await gerbil.dispose();

Long-Running Sessions

For background services or persistent processes, monitor and clean up periodically:

long-running.ts
const gerbil = new Gerbil();
await gerbil.loadModel("qwen3-0.6b");
// Periodic memory monitoring
setInterval(async () => {
const mem = await gerbil.getMemoryUsage();
if (mem && mem.usedGB > 10) {
console.warn(`High memory: ${mem.usedGB.toFixed(1)}GB`);
await gerbil.clearCache();
}
}, 60000); // Check every minute

Cleanup Best Practices

✓ Good: Dispose when done

good.ts
const gerbil = new Gerbil();
await gerbil.loadModel("qwen3-0.6b");
// ... use gerbil ...
await gerbil.dispose(); // Frees resources

✗ Bad: Creating many without cleanup

bad.ts
for (let i = 0; i < 10; i++) {
const g = new Gerbil();
await g.loadModel("qwen3-0.6b");
// Forgot to dispose - pages accumulate!
}

Performance Tips

Batch UI Updates

Tokens arrive very fast (~100/sec). Consider batching UI updates with requestAnimationFrame.

Preload Models

Load models during idle time so they're cached for instant use later.

Use Smaller Models

smollm2-135m is 4x smaller than Qwen3 and often sufficient for simple tasks.

Always Cleanup

Call gerbil.dispose() or worker.terminate() to free GPU memory.