On-Device

Native WebGPU

The whole agent loop,
on the user's GPU

Name: Gerbil
Author: Gerbil

Gerbil runs text, vision, and embeddings with tools, skills, and MCP: the entire agent loop, executing on the user's GPU.

Browser and Node. No cloud, no API keys. Runs anywhere WebGPU runs, including iPad and iPhone.

Run the agent loop Read the docs

Capabilities

What runs locally

Every capability here runs on-device through the native WebGPU engine. No server round-trips, no API keys. Text, vision, embeddings, speech, and the full agent stack: tools, skills, MCP, and memory.

Live

Text

Qwen3.5 & LFM2.5 generation, streaming, structured output.

Live

Vision

Describe & reason about images via the Qwen3.5 ViT.

Live

Embeddings

EmbeddingGemma semantic search & similarity.

Live

Tools

Function calling: the model invokes your code mid-generation.

Live

Skills

Composable, reusable agent capabilities out of the box.

Live

MCP

Model Context Protocol server & client, locally wired.

Live

Memory & RAG

Persistent on-device memory and retrieval: agents remember across sessions, no server.

Live

Speech

Moonshine speech-to-text, on-device. Native text-to-speech (Kani-TTS-2) coming.

The Agent Loop

Watch an autonomous agent run

An inference engine returns text. A harness runs the whole loop. Ask a question and the on-device Qwen3.5 agent reads the page you're on, recalls what you've done this session from native memory, then plans and calls a sequence of skills, searching the docs, generating code, even asking you a question, before it answers. Every step is traced.

·Gathers context from the live page: route, headings, what's rendered
·Stores & recalls your session via EmbeddingGemma memory (IndexedDB)
·Plans and calls multiple skills in a loop: docs search, code-gen, recall
·Asks you a clarifying question when it decides it needs one
·Model, embeddings, and memory all on your GPU. Nothing sent anywhere

How tool calling works

Loading agent…

Multi-Step

A loop that checks its own work

A harness is not one call, it is a loop. Here a classifier grades itself before it answers: label the sentiment, verify that label with a second pass, build a typed object, validate it against a schema, and only then respond. Five steps, every one on the user's GPU, nothing sent anywhere.

1.Classify the sentiment in one word
2.Verify that label with a second model call
3.Construct the result object
4.Validate it against the schema
5.Respond only if it holds

Structured output & validation

Loading pipeline…

View the code

sentiment-pipeline.ts

01import { useEngine } from "@tryhamster/gerbil/hooks";
02
03type Analysis = {
04  sentence: string;
05  sentiment: "positive" | "negative" | "neutral";
06  verified: boolean;
07};
08
09const LABELS = ["positive", "negative", "neutral"] as const;
10
11// The schema every response must satisfy.
12function isAnalysis(o: any): o is Analysis {
13  return (
14    typeof o.sentence === "string" &&
15    LABELS.includes(o.sentiment) &&
16    typeof o.verified === "boolean"
17  );
18}
19
20function Pipeline() {
21  const { complete, isReady } = useEngine();
22
23  // Five steps, every one on the user's GPU.
24  async function analyze(sentence: string): Promise<Analysis> {
25    // 1. Sentiment — one word from the on-device model.
26    const raw = await complete(
27      `Reply with ONLY one word ` +
28      `(positive, negative, or neutral) ` +
29      `based on the sentiment in this sentence: ${sentence}`,
30      { maxTokens: 8, temperature: 0 }
31    );
32    const sentiment = raw.toLowerCase().match(
33      /positive|negative|neutral/
34    )?.[0] as Analysis["sentiment"];
35
36    // 2. Validation call — a second pass grades the label.
37    const verdict = await complete(
38      `Does "${sentiment}" correctly describe the sentiment of: ` +
39      `"${sentence}"? Reply yes or no.`,
40      { maxTokens: 6, temperature: 0 }
41    );
42    const verified = /^\s*y/i.test(verdict.trim());
43
44    // 3. Construct the object.
45    const result = { sentence, sentiment, verified };
46
47    // 4. Validate against the schema before trusting it.
48    if (!isAnalysis(result)) {
49      throw new Error("model output failed schema validation");
50    }
51
52    // 5. Respond.
53    return result;
54  }
55
56  return (
57    <button onClick={() => analyze("Best purchase all year!")} disabled={!isReady}>
58      Analyze
59    </button>
60  );
61}

Live Demo

More that runs on your GPU

Three more native capabilities, running fully in this browser tab. Attach an image and the Qwen3.5 vision tower describes it. Type two phrases and EmbeddingGemma scores how close they mean. Tap the mic and your speech is transcribed on-device, all on the WGSL compute engine. Nothing sent anywhere.

·First run downloads the model, then it's cached
·Images, text, and audio never leave your device
·Needs WebGPU (Chrome/Edge 113+, desktop Safari 18+, iPad/iPhone on iOS/iPadOS 26+)
·Vision, embeddings, and transcription, each on-device

Open the full playground

Loading demo...

Why On-Device

Why it matters

Moving the whole agent loop onto the device changes the economics and the guarantees of what you can ship.

Private by default

Prompts, images, and embeddings never leave the device. Ship AI in healthcare, finance, anywhere data can't go to a server.

$0 inference cost

It runs on the user's GPU. No per-token billing, no API keys, no model servers to scale or pay for.

Works offline

Once the model is cached in IndexedDB, the whole harness keeps working with no network at all.

Runs anywhere WebGPU runs

Chrome/Edge 113+, Firefox 141+, desktop Safari 18+, and iPad/iPhone on iOS/iPadOS 26+. Plus Node via node-dawn. One harness, every surface.

Build the whole agent locally

$ npm install @tryhamster/gerbil

Read the Docs Try in Browser

The whole agent loop,on the user's GPU

What runs locally

Text

Vision

Embeddings

Tools

Skills

MCP

Memory & RAG

Speech

Watch an autonomous agent run

A loop that checks its own work

More that runs on your GPU

Why it matters

Private by default

$0 inference cost

Works offline

Runs anywhere WebGPU runs

Build the whole agent locally

The whole agent loop,
on the user's GPU