Vision
Understand and analyze images with Vision Language Models (VLMs) running locally on your device.
Quick Start
Analyze images with just a few lines of code:
01import { Gerbil } from "@tryhamster/gerbil";02
03const g = new Gerbil();04await g.loadModel("ministral-3b"); // Vision-capable model05
06const result = await g.generate("What's in this image?", {07 images: [{ source: "https://example.com/photo.jpg" }]08});09
10console.log(result.text);11// → "A golden retriever playing fetch in a sunny park..."Vision Models
Currently supported vision-capable models:
| Model | Parameters | Context | Features | Size |
|---|---|---|---|---|
ministral-3b | 3B (+ 0.4B vision) | 256K tokens | Vision + Reasoning | ~2.5GB |
Image Input Formats
Gerbil accepts images in multiple formats:
// URL (recommended for web images)images: [{ source: "https://example.com/image.jpg" }]
// Data URI (base64 encoded)images: [{ source: "..." }]
// Local file path (Node.js only, auto-converted to data URI)images: [{ source: "/path/to/image.png" }]
// With alt text (optional, provides context to the model)images: [{ source: "...", alt: "A photo of a sunset over the ocean" }]Multiple Images
Pass multiple images for comparison or multi-image understanding:
const result = await g.generate("What's the difference between these two images?", { images: [ { source: "https://example.com/before.jpg" }, { source: "https://example.com/after.jpg" } ]});
console.log(result.text);// → "The first image shows the room before renovation with beige walls..."Model Capability Detection
Check if the loaded model supports vision:
await g.loadModel("ministral-3b");
if (g.supportsVision()) { // Use vision features const result = await g.generate("Describe this", { images: [{ source: imageUrl }] });} else { // Text-only mode const result = await g.generate("Describe what you know about...");}Graceful Fallback
If you pass images to a non-vision model, Gerbil handles it gracefully:
// This works with ANY model - images are used if supportedawait g.loadModel("qwen3-0.6b"); // Non-vision model
const result = await g.generate("Describe this", { images: [{ source: imageUrl }]});// → Logs warning, ignores images, processes text prompt normallyAI SDK Integration
Use vision models with Vercel AI SDK v5+:
01import { generateText } from "ai";02import { gerbil } from "@tryhamster/gerbil/ai";03
04const { text } = await generateText({05 model: gerbil("ministral-3b"),06 messages: [07 {08 role: "user",09 content: [10 { type: "image", image: new URL("https://example.com/photo.jpg") },11 { type: "text", text: "Describe this image in detail" },12 ],13 },14 ],15});Supported Image Part Formats
// URL object{ type: "image", image: new URL("https://...") }
// URL string{ type: "image", image: "https://..." }
// Base64 string (data URI){ type: "image", image: "data:image/png;base64,..." }
// Uint8Array with mime type{ type: "image", image: imageBytes, mimeType: "image/png" }Express & Next.js
Express
import express from "express";import { gerbil } from "@tryhamster/gerbil/express";
const app = express();app.use("/ai", gerbil({ model: "ministral-3b" })());
// POST /ai/generate// Body: { prompt: "Describe this", images: [{ source: "https://..." }] }Next.js App Router
// app/api/chat/route.tsimport { gerbil } from "@tryhamster/gerbil/next";
export const POST = gerbil.handler({ model: "ministral-3b" });
// Client usage:// fetch("/api/chat", {// method: "POST",// body: JSON.stringify({// prompt: "What's in this image?",// images: [{ source: dataUri }]// })// })React Hooks
Use the useChat hook with image attachments:
01import { useChat } from "@tryhamster/gerbil/browser";02
03function VisionChat() {04 const { 05 messages, 06 input, 07 setInput, 08 handleSubmit,09 attachImage,10 attachedImages,11 clearImages,12 sendWithImages,13 } = useChat({ model: "ministral-3b" });14
15 const handleFileSelect = (e: React.ChangeEvent<HTMLInputElement>) => {16 const file = e.target.files?.[0];17 if (file) {18 const reader = new FileReader();19 reader.onload = () => attachImage(reader.result as string);20 reader.readAsDataURL(file);21 }22 };23
24 return (25 <div>26 {/* Messages with image support */}27 {messages.map(m => (28 <div key={m.id}>29 {m.images?.map((img, i) => (30 <img key={i} src={img} alt="" className="max-w-xs rounded" />31 ))}32 <p>{m.content}</p>33 </div>34 ))}35
36 {/* Image attachment */}37 <input type="file" accept="image/*" onChange={handleFileSelect} />38 39 {attachedImages.length > 0 && (40 <div className="flex items-center gap-2">41 📎 {attachedImages.length} image(s) attached42 <button onClick={clearImages}>Clear</button>43 </div>44 )}45
46 {/* Input form */}47 <form onSubmit={handleSubmit}>48 <input 49 value={input} 50 onChange={e => setInput(e.target.value)}51 placeholder="Describe the image..."52 />53 <button type="submit">Send</button>54 </form>55 </div>56 );57}Built-in Vision Skills
Gerbil includes pre-built skills for common vision tasks:
describe-image
Generate detailed image descriptions.
import { describeImage } from "@tryhamster/gerbil/skills";
const description = await describeImage({ image: "https://example.com/photo.jpg", focus: "details", // "general" | "details" | "text" | "objects" | "scene" format: "bullets", // "paragraph" | "bullets" | "structured"});analyze-screenshot
Analyze UI screenshots for feedback.
import { analyzeScreenshot } from "@tryhamster/gerbil/skills";
const analysis = await analyzeScreenshot({ image: screenshotDataUri, type: "accessibility", // "ui-review" | "accessibility" | "suggestions" | "qa"});// → "Missing alt text on hero image. Color contrast ratio is 3.2:1..."extract-from-image
Extract text, code, or data from images.
import { extractFromImage } from "@tryhamster/gerbil/skills";
const extracted = await extractFromImage({ image: documentPhoto, extract: "text", // "text" | "data" | "code" | "table" | "diagram" outputFormat: "markdown", // "raw" | "json" | "markdown"});compare-images
Compare two images and describe differences.
import { compareImages } from "@tryhamster/gerbil/skills";
const comparison = await compareImages({ image1: beforeScreenshot, image2: afterScreenshot, focus: "differences", // "differences" | "similarities" | "detailed"});// → "The header color changed from blue to green. A new banner..."caption-image
Generate alt text or captions for images.
import { captionImage } from "@tryhamster/gerbil/skills";
const caption = await captionImage({ image: photo, style: "descriptive", // "concise" | "descriptive" | "creative" | "funny"});// → "A golden sunset paints the sky in warm oranges and purples..."Performance Tips
WebGPU Acceleration
Vision models benefit significantly from GPU acceleration:
// Node.js: Uses Chrome backend for WebGPUawait g.loadModel("ministral-3b"); // Auto-detects WebGPU
// Browser: Native WebGPUawait g.loadModel("ministral-3b", { device: "webgpu" });
// Check current device modeconsole.log(g.getDeviceMode()); // "webgpu" | "cpu"Image Size
- —Larger images take longer to process
- —Consider resizing to 512×512 to 1024×1024 for optimal performance
- —Models cache in IndexedDB (browser) after first download
Expected Performance
| Metric | Value |
|---|---|
| Vision model load time | ~2s (cached) |
| Image processing | ~0.5s |
| Generation speed | 70-100+ tok/s (WebGPU) |
| Memory usage | ~4GB (model + KV cache) |
Troubleshooting
"Model doesn't support vision"
Make sure you're using a vision-capable model like ministral-3b.
Slow image processing
- —Ensure WebGPU is being used:
g.getDeviceMode() - —Resize large images before sending
- —In Node.js, the Chrome backend provides GPU acceleration
Image not loading
- —Check the URL is accessible (CORS may block some URLs)
- —For local files, ensure the path is absolute
- —Base64 data URIs must include the mime type prefix:
data:image/png;base64,...
API Reference
ImageInput
interface ImageInput { /** Image source: URL, base64 data URI, or local file path */ source: string; /** Optional alt text for context */ alt?: string;}GenerateOptions (with images)
interface GenerateOptions { // ... standard options ... /** Images to include (only used if model supports vision) */ images?: ImageInput[];}supportsVision()
// Returns true if the loaded model supports vision inputg.supportsVision(): booleanModelConfig (vision fields)
interface ModelConfig { // ... standard properties ... /** Whether model supports vision/image input */ supportsVision?: boolean; /** Size of vision encoder (if applicable) */ visionEncoderSize?: string;}