The whole agent loop,
on the user's GPU
Gerbil runs text, vision, and embeddings with tools, skills, and MCP — the entire agent loop, executing on the user's GPU.
Browser and Node. No cloud, no API keys. Runs anywhere WebGPU runs, including iPad and iPhone.
What runs locally
Every box below runs on-device through the native WebGPU engine — no server round-trips, no API keys. Text, vision, embeddings, speech, and the full agent stack: tools, skills, MCP, and memory.
Text
Qwen3.5 & LFM2.5 generation, streaming, structured output.
Vision
Describe & reason about images via the Qwen3.5 ViT.
Embeddings
EmbeddingGemma semantic search & similarity.
Tools
Function calling — the model invokes your code mid-generation.
Skills
Composable, reusable agent capabilities out of the box.
MCP
Model Context Protocol server & client, locally wired.
Memory & RAG
Persistent on-device memory and retrieval — agents remember across sessions, no server.
Speech
Moonshine speech-to-text, on-device. Native text-to-speech (Kani-TTS-2) coming.
Watch an autonomous agent run
An inference engine returns text. A harness runs the whole loop. Ask a question and the on-device Qwen3.5 agent reads the page you're on, recalls what you've done this session from native memory, then plans and calls a sequence of skills — searching the docs, generating code, even asking you a question — before it answers. Every step is traced.
- —Gathers context from the live page — route, headings, what's rendered
- —Stores & recalls your session via EmbeddingGemma memory (IndexedDB)
- —Plans and calls multiple skills in a loop — docs search, code-gen, recall
- —Asks you a clarifying question when it decides it needs one
- —Model, embeddings, and memory all on your GPU — nothing sent anywhere
More that runs on your GPU
Three more native capabilities, running fully in this browser tab. Attach an image and the Qwen3.5 vision tower describes it. Type two phrases and EmbeddingGemma scores how close they mean. Tap the mic and your speech is transcribed on-device — all on the WGSL compute engine, nothing sent anywhere.
- —First run downloads the model, then it's cached
- —Images, text, and audio never leave your device
- —Needs WebGPU (Chrome/Edge 113+, desktop Safari 18+, iPad/iPhone on iOS/iPadOS 26+)
- —Vision, embeddings, and transcription — each on-device
Why it matters
Moving the whole agent loop onto the device changes the economics and the guarantees of what you can ship.
Private by default
Prompts, images, and embeddings never leave the device. Ship AI in healthcare, finance, anywhere data can't go to a server.
$0 inference cost
It runs on the user's GPU. No per-token billing, no API keys, no model servers to scale or pay for.
Works offline
Once the model is cached in IndexedDB, the whole harness keeps working with no network at all.
Runs anywhere WebGPU runs
Chrome/Edge 113+, Firefox 141+, desktop Safari 18+, and iPad/iPhone on iOS/iPadOS 26+. Plus Node via node-dawn. One harness, every surface.