On-Device
Native WebGPU

The whole agent loop,
on the user's GPU

Gerbil runs text, vision, and embeddings with tools, skills, and MCP — the entire agent loop, executing on the user's GPU.

Browser and Node. No cloud, no API keys. Runs anywhere WebGPU runs, including iPad and iPhone.

Capabilities

What runs locally

Every box below runs on-device through the native WebGPU engine — no server round-trips, no API keys. Text, vision, embeddings, speech, and the full agent stack: tools, skills, MCP, and memory.

Live

Text

Qwen3.5 & LFM2.5 generation, streaming, structured output.

Live

Vision

Describe & reason about images via the Qwen3.5 ViT.

Live

Embeddings

EmbeddingGemma semantic search & similarity.

Live

Tools

Function calling — the model invokes your code mid-generation.

Live

Skills

Composable, reusable agent capabilities out of the box.

Live

MCP

Model Context Protocol server & client, locally wired.

Live

Memory & RAG

Persistent on-device memory and retrieval — agents remember across sessions, no server.

Live

Speech

Moonshine speech-to-text, on-device. Native text-to-speech (Kani-TTS-2) coming.

The Agent Loop

Watch an autonomous agent run

An inference engine returns text. A harness runs the whole loop. Ask a question and the on-device Qwen3.5 agent reads the page you're on, recalls what you've done this session from native memory, then plans and calls a sequence of skills — searching the docs, generating code, even asking you a question — before it answers. Every step is traced.

  • Gathers context from the live page — route, headings, what's rendered
  • Stores & recalls your session via EmbeddingGemma memory (IndexedDB)
  • Plans and calls multiple skills in a loop — docs search, code-gen, recall
  • Asks you a clarifying question when it decides it needs one
  • Model, embeddings, and memory all on your GPU — nothing sent anywhere
How tool calling works
Loading agent…
Live Demo

More that runs on your GPU

Three more native capabilities, running fully in this browser tab. Attach an image and the Qwen3.5 vision tower describes it. Type two phrases and EmbeddingGemma scores how close they mean. Tap the mic and your speech is transcribed on-device — all on the WGSL compute engine, nothing sent anywhere.

  • First run downloads the model, then it's cached
  • Images, text, and audio never leave your device
  • Needs WebGPU (Chrome/Edge 113+, desktop Safari 18+, iPad/iPhone on iOS/iPadOS 26+)
  • Vision, embeddings, and transcription — each on-device
Open the full playground
Loading demo...
Why On-Device

Why it matters

Moving the whole agent loop onto the device changes the economics and the guarantees of what you can ship.

01

Private by default

Prompts, images, and embeddings never leave the device. Ship AI in healthcare, finance, anywhere data can't go to a server.

02

$0 inference cost

It runs on the user's GPU. No per-token billing, no API keys, no model servers to scale or pay for.

03

Works offline

Once the model is cached in IndexedDB, the whole harness keeps working with no network at all.

04

Runs anywhere WebGPU runs

Chrome/Edge 113+, Firefox 141+, desktop Safari 18+, and iPad/iPhone on iOS/iPadOS 26+. Plus Node via node-dawn. One harness, every surface.

Build the whole agent locally

$ npm install @tryhamster/gerbil