Vision

Understand and analyze images with Vision Language Models (VLMs) running locally on your device.

Quick Start

Analyze images with just a few lines of code:

vision-basic.ts
01import { Gerbil } from "@tryhamster/gerbil";
02
03const g = new Gerbil();
04await g.loadModel("ministral-3b"); // Vision-capable model
05
06const result = await g.generate("What's in this image?", {
07 images: [{ source: "https://example.com/photo.jpg" }]
08});
09
10console.log(result.text);
11// → "A golden retriever playing fetch in a sunny park..."

Vision Models

Currently supported vision-capable models:

ModelParametersContextFeaturesSize
ministral-3b3B (+ 0.4B vision)256K tokensVision + Reasoning~2.5GB
Note: More vision models will be added as they become available in ONNX format.

Image Input Formats

Gerbil accepts images in multiple formats:

image-formats.ts
// URL (recommended for web images)
images: [{ source: "https://example.com/image.jpg" }]
// Data URI (base64 encoded)
images: [{ source: "..." }]
// Local file path (Node.js only, auto-converted to data URI)
images: [{ source: "/path/to/image.png" }]
// With alt text (optional, provides context to the model)
images: [{ source: "...", alt: "A photo of a sunset over the ocean" }]

Multiple Images

Pass multiple images for comparison or multi-image understanding:

multiple-images.ts
const result = await g.generate("What's the difference between these two images?", {
images: [
{ source: "https://example.com/before.jpg" },
{ source: "https://example.com/after.jpg" }
]
});
console.log(result.text);
// → "The first image shows the room before renovation with beige walls..."

Model Capability Detection

Check if the loaded model supports vision:

capability-detection.ts
await g.loadModel("ministral-3b");
if (g.supportsVision()) {
// Use vision features
const result = await g.generate("Describe this", {
images: [{ source: imageUrl }]
});
} else {
// Text-only mode
const result = await g.generate("Describe what you know about...");
}

Graceful Fallback

If you pass images to a non-vision model, Gerbil handles it gracefully:

graceful-fallback.ts
// This works with ANY model - images are used if supported
await g.loadModel("qwen3-0.6b"); // Non-vision model
const result = await g.generate("Describe this", {
images: [{ source: imageUrl }]
});
// → Logs warning, ignores images, processes text prompt normally

AI SDK Integration

Use vision models with Vercel AI SDK v5+:

ai-sdk-vision.ts
01import { generateText } from "ai";
02import { gerbil } from "@tryhamster/gerbil/ai";
03
04const { text } = await generateText({
05 model: gerbil("ministral-3b"),
06 messages: [
07 {
08 role: "user",
09 content: [
10 { type: "image", image: new URL("https://example.com/photo.jpg") },
11 { type: "text", text: "Describe this image in detail" },
12 ],
13 },
14 ],
15});

Supported Image Part Formats

ai-sdk-formats.ts
// URL object
{ type: "image", image: new URL("https://...") }
// URL string
{ type: "image", image: "https://..." }
// Base64 string (data URI)
{ type: "image", image: "data:image/png;base64,..." }
// Uint8Array with mime type
{ type: "image", image: imageBytes, mimeType: "image/png" }

Express & Next.js

Express

express-vision.ts
import express from "express";
import { gerbil } from "@tryhamster/gerbil/express";
const app = express();
app.use("/ai", gerbil({ model: "ministral-3b" })());
// POST /ai/generate
// Body: { prompt: "Describe this", images: [{ source: "https://..." }] }

Next.js App Router

nextjs-vision.ts
// app/api/chat/route.ts
import { gerbil } from "@tryhamster/gerbil/next";
export const POST = gerbil.handler({ model: "ministral-3b" });
// Client usage:
// fetch("/api/chat", {
// method: "POST",
// body: JSON.stringify({
// prompt: "What's in this image?",
// images: [{ source: dataUri }]
// })
// })

React Hooks

Use the useChat hook with image attachments:

react-vision.tsx
01import { useChat } from "@tryhamster/gerbil/browser";
02
03function VisionChat() {
04 const {
05 messages,
06 input,
07 setInput,
08 handleSubmit,
09 attachImage,
10 attachedImages,
11 clearImages,
12 sendWithImages,
13 } = useChat({ model: "ministral-3b" });
14
15 const handleFileSelect = (e: React.ChangeEvent<HTMLInputElement>) => {
16 const file = e.target.files?.[0];
17 if (file) {
18 const reader = new FileReader();
19 reader.onload = () => attachImage(reader.result as string);
20 reader.readAsDataURL(file);
21 }
22 };
23
24 return (
25 <div>
26 {/* Messages with image support */}
27 {messages.map(m => (
28 <div key={m.id}>
29 {m.images?.map((img, i) => (
30 <img key={i} src={img} alt="" className="max-w-xs rounded" />
31 ))}
32 <p>{m.content}</p>
33 </div>
34 ))}
35
36 {/* Image attachment */}
37 <input type="file" accept="image/*" onChange={handleFileSelect} />
38
39 {attachedImages.length > 0 && (
40 <div className="flex items-center gap-2">
41 📎 {attachedImages.length} image(s) attached
42 <button onClick={clearImages}>Clear</button>
43 </div>
44 )}
45
46 {/* Input form */}
47 <form onSubmit={handleSubmit}>
48 <input
49 value={input}
50 onChange={e => setInput(e.target.value)}
51 placeholder="Describe the image..."
52 />
53 <button type="submit">Send</button>
54 </form>
55 </div>
56 );
57}

Built-in Vision Skills

Gerbil includes pre-built skills for common vision tasks:

describe-image

Generate detailed image descriptions.

describe-image.ts
import { describeImage } from "@tryhamster/gerbil/skills";
const description = await describeImage({
image: "https://example.com/photo.jpg",
focus: "details", // "general" | "details" | "text" | "objects" | "scene"
format: "bullets", // "paragraph" | "bullets" | "structured"
});

analyze-screenshot

Analyze UI screenshots for feedback.

analyze-screenshot.ts
import { analyzeScreenshot } from "@tryhamster/gerbil/skills";
const analysis = await analyzeScreenshot({
image: screenshotDataUri,
type: "accessibility", // "ui-review" | "accessibility" | "suggestions" | "qa"
});
// → "Missing alt text on hero image. Color contrast ratio is 3.2:1..."

extract-from-image

Extract text, code, or data from images.

extract-from-image.ts
import { extractFromImage } from "@tryhamster/gerbil/skills";
const extracted = await extractFromImage({
image: documentPhoto,
extract: "text", // "text" | "data" | "code" | "table" | "diagram"
outputFormat: "markdown", // "raw" | "json" | "markdown"
});

compare-images

Compare two images and describe differences.

compare-images.ts
import { compareImages } from "@tryhamster/gerbil/skills";
const comparison = await compareImages({
image1: beforeScreenshot,
image2: afterScreenshot,
focus: "differences", // "differences" | "similarities" | "detailed"
});
// → "The header color changed from blue to green. A new banner..."

caption-image

Generate alt text or captions for images.

caption-image.ts
import { captionImage } from "@tryhamster/gerbil/skills";
const caption = await captionImage({
image: photo,
style: "descriptive", // "concise" | "descriptive" | "creative" | "funny"
});
// → "A golden sunset paints the sky in warm oranges and purples..."

Performance Tips

WebGPU Acceleration

Vision models benefit significantly from GPU acceleration:

webgpu.ts
// Node.js: Uses Chrome backend for WebGPU
await g.loadModel("ministral-3b"); // Auto-detects WebGPU
// Browser: Native WebGPU
await g.loadModel("ministral-3b", { device: "webgpu" });
// Check current device mode
console.log(g.getDeviceMode()); // "webgpu" | "cpu"

Image Size

  • Larger images take longer to process
  • Consider resizing to 512×512 to 1024×1024 for optimal performance
  • Models cache in IndexedDB (browser) after first download

Expected Performance

MetricValue
Vision model load time~2s (cached)
Image processing~0.5s
Generation speed70-100+ tok/s (WebGPU)
Memory usage~4GB (model + KV cache)

Troubleshooting

"Model doesn't support vision"

Make sure you're using a vision-capable model like ministral-3b.

Slow image processing

  • Ensure WebGPU is being used: g.getDeviceMode()
  • Resize large images before sending
  • In Node.js, the Chrome backend provides GPU acceleration

Image not loading

  • Check the URL is accessible (CORS may block some URLs)
  • For local files, ensure the path is absolute
  • Base64 data URIs must include the mime type prefix: data:image/png;base64,...

API Reference

ImageInput

ImageInput.ts
interface ImageInput {
/** Image source: URL, base64 data URI, or local file path */
source: string;
/** Optional alt text for context */
alt?: string;
}

GenerateOptions (with images)

GenerateOptions.ts
interface GenerateOptions {
// ... standard options ...
/** Images to include (only used if model supports vision) */
images?: ImageInput[];
}

supportsVision()

supportsVision.ts
// Returns true if the loaded model supports vision input
g.supportsVision(): boolean

ModelConfig (vision fields)

ModelConfig.ts
interface ModelConfig {
// ... standard properties ...
/** Whether model supports vision/image input */
supportsVision?: boolean;
/** Size of vision encoder (if applicable) */
visionEncoderSize?: string;
}