Vision

Name: Gerbil
Author: Gerbil

Understand and analyze images with Vision Language Models (VLMs) running locally on your device.

Quick Start

Analyze images with just a few lines of code:

vision-basic.ts

01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04await g.loadModel("ministral-3b"); // Vision-capable model
05
06const result = await g.generate("What's in this image?", {
07  images: [{ source: "https://example.com/photo.jpg" }]
08});
09
10console.log(result.text);
11// → "A golden retriever playing fetch in a sunny park..."

Vision Models

Currently supported vision-capable models:

Model	Parameters	Context	Features	Size
`ministral-3b`	3B (+ 0.4B vision)	256K tokens	Vision + Reasoning	~2.5GB

Note: More vision models will be added as they become available in ONNX format.

Image Input Formats

Gerbil accepts images in multiple formats:

image-formats.ts

// URL (recommended for web images)
images: [{ source: "https://example.com/image.jpg" }]

// Data URI (base64 encoded)
images: [{ source: "data:image/png;base64,iVBORw0KGgo..." }]

// Local file path (Node.js only, auto-converted to data URI)
images: [{ source: "/path/to/image.png" }]

// With alt text (optional, provides context to the model)
images: [{ source: "...", alt: "A photo of a sunset over the ocean" }]

Multiple Images

Pass multiple images for comparison or multi-image understanding:

multiple-images.ts

const result = await g.generate("What's the difference between these two images?", {
  images: [
    { source: "https://example.com/before.jpg" },
    { source: "https://example.com/after.jpg" }
  ]
});

console.log(result.text);
// → "The first image shows the room before renovation with beige walls..."

Model Capability Detection

Check if the loaded model supports vision:

capability-detection.ts

await g.loadModel("ministral-3b");

if (g.supportsVision()) {
  // Use vision features
  const result = await g.generate("Describe this", {
    images: [{ source: imageUrl }]
  });
} else {
  // Text-only mode
  const result = await g.generate("Describe what you know about...");
}

Graceful Fallback

If you pass images to a non-vision model, Gerbil handles it gracefully:

graceful-fallback.ts

// This works with ANY model - images are used if supported
await g.loadModel("qwen3-0.6b"); // Non-vision model

const result = await g.generate("Describe this", {
  images: [{ source: imageUrl }]
});
// → Logs warning, ignores images, processes text prompt normally

AI SDK Integration

Use vision models with Vercel AI SDK v5+:

ai-sdk-vision.ts

01import { generateText } from "ai";
02import { gerbil } from "@tryhamster/gerbil/ai";
03
04const { text } = await generateText({
05  model: gerbil("ministral-3b"),
06  messages: [
07    {
08      role: "user",
09      content: [
10        { type: "image", image: new URL("https://example.com/photo.jpg") },
11        { type: "text", text: "Describe this image in detail" },
12      ],
13    },
14  ],
15});

Supported Image Part Formats

ai-sdk-formats.ts

// URL object
{ type: "image", image: new URL("https://...") }

// URL string
{ type: "image", image: "https://..." }

// Base64 string (data URI)
{ type: "image", image: "data:image/png;base64,..." }

// Uint8Array with mime type
{ type: "image", image: imageBytes, mimeType: "image/png" }

Express & Next.js

Express

express-vision.ts

import express from "express";
import { gerbil } from "@tryhamster/gerbil/express";

const app = express();
app.use("/ai", gerbil({ model: "ministral-3b" })());

// POST /ai/generate
// Body: { prompt: "Describe this", images: [{ source: "https://..." }] }

Next.js App Router

nextjs-vision.ts

// app/api/chat/route.ts
import { gerbil } from "@tryhamster/gerbil/next";

export const POST = gerbil.handler({ model: "ministral-3b" });

// Client usage:
// fetch("/api/chat", {
//   method: "POST",
//   body: JSON.stringify({
//     prompt: "What's in this image?",
//     images: [{ source: dataUri }]
//   })
// })

React Hooks

Use the useChat hook with image attachments:

react-vision.tsx

01import { useChat } from "@tryhamster/gerbil/browser";
02
03function VisionChat() {
04  const { 
05    messages, 
06    input, 
07    setInput, 
08    handleSubmit,
09    attachImage,
10    attachedImages,
11    clearImages,
12    sendWithImages,
13  } = useChat({ model: "ministral-3b" });
14
15  const handleFileSelect = (e: React.ChangeEvent<HTMLInputElement>) => {
16    const file = e.target.files?.[0];
17    if (file) {
18      const reader = new FileReader();
19      reader.onload = () => attachImage(reader.result as string);
20      reader.readAsDataURL(file);
21    }
22  };
23
24  return (
25    <div>
26      {/* Messages with image support */}
27      {messages.map(m => (
28        <div key={m.id}>
29          {m.images?.map((img, i) => (
30            <img key={i} src={img} alt="" className="max-w-xs rounded" />
31          ))}
32          <p>{m.content}</p>
33        </div>
34      ))}
35
36      {/* Image attachment */}
37      <input type="file" accept="image/*" onChange={handleFileSelect} />
38      
39      {attachedImages.length > 0 && (
40        <div className="flex items-center gap-2">
41          📎 {attachedImages.length} image(s) attached
42          <button onClick={clearImages}>Clear</button>
43        </div>
44      )}
45
46      {/* Input form */}
47      <form onSubmit={handleSubmit}>
48        <input 
49          value={input} 
50          onChange={e => setInput(e.target.value)}
51          placeholder="Describe the image..."
52        />
53        <button type="submit">Send</button>
54      </form>
55    </div>
56  );
57}

Built-in Vision Skills

Gerbil includes pre-built skills for common vision tasks:

describe-image

Generate detailed image descriptions.

describe-image.ts

import { describeImage } from "@tryhamster/gerbil/skills";

const description = await describeImage({
  image: "https://example.com/photo.jpg",
  focus: "details", // "general" | "details" | "text" | "objects" | "scene"
  format: "bullets", // "paragraph" | "bullets" | "structured"
});

analyze-screenshot

Analyze UI screenshots for feedback.

analyze-screenshot.ts

import { analyzeScreenshot } from "@tryhamster/gerbil/skills";

const analysis = await analyzeScreenshot({
  image: screenshotDataUri,
  type: "accessibility", // "ui-review" | "accessibility" | "suggestions" | "qa"
});
// → "Missing alt text on hero image. Color contrast ratio is 3.2:1..."

extract-from-image

Extract text, code, or data from images.

extract-from-image.ts

import { extractFromImage } from "@tryhamster/gerbil/skills";

const extracted = await extractFromImage({
  image: documentPhoto,
  extract: "text", // "text" | "data" | "code" | "table" | "diagram"
  outputFormat: "markdown", // "raw" | "json" | "markdown"
});

compare-images

Compare two images and describe differences.

compare-images.ts

import { compareImages } from "@tryhamster/gerbil/skills";

const comparison = await compareImages({
  image1: beforeScreenshot,
  image2: afterScreenshot,
  focus: "differences", // "differences" | "similarities" | "detailed"
});
// → "The header color changed from blue to green. A new banner..."

caption-image

Generate alt text or captions for images.

caption-image.ts

import { captionImage } from "@tryhamster/gerbil/skills";

const caption = await captionImage({
  image: photo,
  style: "descriptive", // "concise" | "descriptive" | "creative" | "funny"
});
// → "A golden sunset paints the sky in warm oranges and purples..."

Performance Tips

WebGPU Acceleration

Vision models benefit significantly from GPU acceleration:

webgpu.ts

// Node.js: Uses Chrome backend for WebGPU
await g.loadModel("ministral-3b"); // Auto-detects WebGPU

// Browser: Native WebGPU
await g.loadModel("ministral-3b", { device: "webgpu" });

// Check current device mode
console.log(g.getDeviceMode()); // "webgpu" | "cpu"

Image Size

—Larger images take longer to process
—Consider resizing to 512×512 to 1024×1024 for optimal performance
—Models cache in IndexedDB (browser) after first download

Expected Performance

Metric	Value
Vision model load time	~2s (cached)
Image processing	~0.5s
Generation speed	70-100+ tok/s (WebGPU)
Memory usage	~4GB (model + KV cache)

Troubleshooting

"Model doesn't support vision"

Make sure you're using a vision-capable model like ministral-3b.

Slow image processing

—Ensure WebGPU is being used: g.getDeviceMode()
—Resize large images before sending
—In Node.js, the Chrome backend provides GPU acceleration

Image not loading

—Check the URL is accessible (CORS may block some URLs)
—For local files, ensure the path is absolute
—Base64 data URIs must include the mime type prefix: data:image/png;base64,...

API Reference

ImageInput

ImageInput.ts

interface ImageInput {
  /** Image source: URL, base64 data URI, or local file path */
  source: string;
  /** Optional alt text for context */
  alt?: string;
}

GenerateOptions (with images)

GenerateOptions.ts

interface GenerateOptions {
  // ... standard options ...
  
  /** Images to include (only used if model supports vision) */
  images?: ImageInput[];
}

supportsVision()

supportsVision.ts

// Returns true if the loaded model supports vision input
g.supportsVision(): boolean

ModelConfig (vision fields)

ModelConfig.ts

interface ModelConfig {
  // ... standard properties ...
  
  /** Whether model supports vision/image input */
  supportsVision?: boolean;
  
  /** Size of vision encoder (if applicable) */
  visionEncoderSize?: string;
}