Models
Gerbil ships with native models that are ready to run out of the box, and it can load almost any compatible model straight from Hugging Face.
Native Models
The native WebGPU engine runs these models entirely on-device. Load them by HuggingFace repo id with WebGPUEngine.create:
| Model | Type | Size | Speed | Think | Vision | Embed | TTS | STT | Browser | Best For |
|---|---|---|---|---|---|---|---|---|---|---|
qwen3.5-0.8b | LLM | ~650 MB | Text + vision, 262K context, reasoning | |||||||
qwen3.5-2b | LLM | ~1.7 GB | Text + vision, higher quality — iPad & desktop | |||||||
lfm2.5-350m | LLM | ~300 MB | Fast, lightweight text generation | |||||||
gemma-4-e2b | LLM | ~3.6 GB (PLE streams, ~1GB resident) | Text + vision, larger desktop-leaning quality, reasoning | |||||||
Embeddings | ||||||||||
embeddinggemma-300m | EMBED | ~173MB | 768-dim semantic search (asymmetric query/document) | |||||||
Text-to-Speech | ||||||||||
kani-tts-2 | TTS | — | On-device speech synthesis (coming — not yet runnable) | |||||||
Speech-to-Text | ||||||||||
moonshine-base | STT | ~190MB | Raw-waveform English transcription (greedy) | |||||||
Choosing a Model
For general use, vision & reasoning
Qwen3.5-0.8B is the default all-rounder: a 262K context window, thinking support, and a built-in vision tower — all in one checkpoint. Roughly ~225 tok/s in-browser on desktop, ~50 tok/s on iPad, and ~38 tok/s on iPhone (Safari 26+).
import { WebGPUEngine } from "@tryhamster/gerbil/gpu";
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", // ~404 MB});const result = await engine.generate("What is 17 * 23?", { sampling: { temperature: 0 }, // greedy});console.log(result.text);console.log(result.thinking); // step-by-step reasoning when the model thinksFor speed
LFM2.5-350M (LiquidAI's hybrid conv/attention model) is the faster, smaller text default — ~600 tok/s desktop server-side (2.8× Qwen), ~46 tok/s in-browser on mobile.
const engine = await WebGPUEngine.create({ repo: "LiquidAI/LFM2.5-350M-MLX-4bit", // ~199 MB});await engine.generate("Write a haiku about coding");For larger, desktop-leaning quality
gemma-4-e2b-it-4bit is the smallest Gemma 4 — a stronger text-only decoder. The 4-bit checkpoint is ~2.5 GB on disk, but its per-layer embedding (PLE) table streams from CPU, so resident GPU memory stays ~0.8–1.1 GB (mobile-viable, ~83 tok/s in-browser). It leans desktop next to Qwen3.5 / LFM2.5. Gemma 4 runs here as a text-only model — for images, use Qwen3.5's built-in vision tower.
const engine = await WebGPUEngine.create({ repo: "mlx-community/gemma-4-e2b-it-4bit", // ~2.5 GB, PLE streams from CPU});await engine.generate("Explain WebGPU compute shaders to a web developer");For vision
Qwen3.5-0.8B understands images through its own ViT tower (no separate vision model). Pass enableVision: true and call describeImage:
const engine = await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit", enableVision: true, // builds the ViT tower (+~192 MB)});const { text } = await engine.describeImage( { pixels, width, height }, // RGB pixels from a canvas "Describe this image.");console.log(text);See the Vision documentation for decoding images to pixels and more examples.
For embeddings
EmbeddingGemma-300M produces 768-dim, L2-normalized vectors with asymmetric query/document prefixes. Runs on iPad.
const engine = await WebGPUEngine.create({ repo: "mlx-community/embeddinggemma-300m-4bit", // ~173 MB embedding: true,});const vec = await engine.embed("how do I cache a model?", { taskType: "query" });console.log(vec.length); // 768Using HuggingFace Models
The native engine loads compatible models straight from HuggingFace by repo id. Compact 4-bit MLX-community repos load fastest and fit comfortably on mobile:
// Pass any compatible repo id to create()await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit" });await WebGPUEngine.create({ repo: "LiquidAI/LFM2.5-350M-MLX-4bit" });await WebGPUEngine.create({ repo: "mlx-community/embeddinggemma-300m-4bit", embedding: true,});Note: The native engine runs supported architectures (Qwen3.5, LFM2, Gemma3 encoder, Moonshine). The engine compiles WGSL compute shaders directly for each model and runs them on WebGPU.
Create Options
Customize how the engine loads and runs a model:
await WebGPUEngine.create({ repo: "mlx-community/Qwen3.5-0.8B-4bit",
// Quantization (from the repo / LoadModelOptions) dtype: "q4", // "f32" | "q4"
// Capability flags embedding: false, // load as an embedding model (enables embed()) enableVision: false, // build the ViT tower (enables describeImage())
// Memory / context maxSeqLen: 4096, // capped 4096 desktop, 512-2048 on WebKit/iOS kvMode: "packed-f16", // "f32" | "native-f16" | "packed-f16" (auto if omitted)
// Progress onProgress: (loaded, total, message) => { const pct = total ? Math.round((loaded / total) * 100) : loaded; console.log(`${pct}% — ${message}`); },});Model Caching
In the browser, models are cached in IndexedDB after the first download — subsequent loads skip the network and reuse the compiled weights. There is no CPU/WASM fallback: a browser without WebGPU cannot run the native engine, so feature-detect WebGPU up front with "gpu" in navigator.