Production Observability

Built-in telemetry hooks and request queuing for production deployments.

Integrates with Sentry · DataDog · Any metrics system

Telemetry Hooks

Configure telemetry hooks to integrate with Sentry, DataDog, or any monitoring system:

telemetry.ts
01import { Gerbil } from "@tryhamster/gerbil";
02import * as Sentry from "@sentry/node";
03
04const g = new Gerbil({
05 telemetry: {
06 // Called on any error (model load, generation, etc.)
07 onError: (error, context) => {
08 Sentry.captureException(error, {
09 extra: context,
10 tags: { operation: context.operation },
11 });
12 },
13
14 // Called after successful generation
15 onGenerate: (event) => {
16 console.log(`Generated ${event.result.tokensGenerated} tokens`);
17 // Track in your metrics system
18 metrics.histogram("gerbil.tokens_generated", event.result.tokensGenerated);
19 metrics.histogram("gerbil.tokens_per_second", event.result.tokensPerSecond);
20 },
21
22 // Called after model loading (success or failure)
23 onModelLoad: (event) => {
24 if (event.success) {
25 console.log(`Loaded ${event.modelId} in ${event.loadTimeMs}ms on ${event.device}`);
26 } else {
27 console.error(`Failed to load ${event.modelId}: ${event.error}`);
28 }
29 },
30
31 // Called when requests wait in queue (>100ms)
32 onQueueWait: (waitTimeMs) => {
33 metrics.histogram("gerbil.queue_wait_ms", waitTimeMs);
34 },
35 },
36});

onError(error, context)

Called whenever an error occurs during Gerbil operations.

types.ts
type ErrorContext = {
operation: "generate" | "load" | "embed" | "speak" | "transcribe" | "json";
modelId?: string;
extra?: Record<string, unknown>;
};

onGenerate(event)

Called after successful text generation.

types.ts
type GenerateEvent = {
modelId: string;
result: GenerateResult;
cached: boolean;
queueTimeMs?: number; // Only if waited >100ms
};

onModelLoad(event)

Called after model loading completes (success or failure).

types.ts
type ModelLoadEvent = {
modelId: string;
loadTimeMs: number;
fromCache: boolean;
device: "webgpu" | "cpu" | "wasm";
success: boolean;
error?: string;
};

onQueueWait(waitTimeMs)

Called when a request waits in the queue for more than 100ms. Useful for detecting congestion.

Request Queue

Gerbil uses a request queue to prevent GPU OOM errors under concurrent load. LLM inference can only run one request at a time on the GPU.

Default Behavior

  • Concurrency: 1 (single request at a time)
  • Timeout: 5 minutes (300,000ms)
  • Requests are processed in FIFO order
  • Timeout errors are thrown if exceeded

Custom Configuration

config.ts
const g = new Gerbil({
concurrency: {
maxConcurrent: 1, // Max parallel requests (default: 1)
timeout: 300_000, // Request timeout in ms (default: 5 min)
},
});

Why Queue?

LLM inference on GPU is:

  1. Memory-bound: Models consume most of GPU VRAM
  2. Non-concurrent: Running multiple inferences simultaneously causes OOM
  3. Variable duration: Generation time depends on output length

The queue ensures predictable memory usage, no OOM crashes under load, and fair request ordering.

Rate Limiting

Gerbil does not include rate limiting. This is intentional—rate limiting is best handled at the application layer using middleware specific to your framework:

rate-limiting.ts
01// Express
02import rateLimit from "express-rate-limit";
03import { gerbil } from "@tryhamster/gerbil/express";
04
05app.use("/ai", rateLimit({ windowMs: 60000, max: 10 }));
06app.use("/ai", gerbil());
07
08// Next.js
09import { Ratelimit } from "@upstash/ratelimit";
10import { Redis } from "@upstash/redis";
11
12const ratelimit = new Ratelimit({
13 redis: Redis.fromEnv(),
14 limiter: Ratelimit.slidingWindow(10, "60s"),
15});
16
17export async function POST(req: Request) {
18 const ip = req.headers.get("x-forwarded-for") ?? "anonymous";
19 const { success } = await ratelimit.limit(ip);
20 if (!success) return Response.json({ error: "Rate limited" }, { status: 429 });
21
22 // Continue with Gerbil...
23}

Full Production Setup

production.ts
01import { Gerbil } from "@tryhamster/gerbil";
02import * as Sentry from "@sentry/node";
03
04Sentry.init({ dsn: process.env.SENTRY_DSN });
05
06const g = new Gerbil({
07 model: "qwen3-0.6b",
08
09 telemetry: {
10 onError: (error, context) => {
11 Sentry.captureException(error, { extra: context });
12 },
13
14 onGenerate: ({ result, queueTimeMs }) => {
15 // Log slow generations
16 if (result.totalTime > 10000) {
17 console.warn(`Slow generation: ${result.totalTime}ms`);
18 }
19
20 // Track queue congestion
21 if (queueTimeMs && queueTimeMs > 5000) {
22 Sentry.captureMessage("High queue wait time", {
23 level: "warning",
24 extra: { queueTimeMs },
25 });
26 }
27 },
28
29 onModelLoad: (event) => {
30 if (!event.success) {
31 Sentry.captureMessage(`Model load failed: ${event.error}`, {
32 level: "error",
33 extra: event,
34 });
35 }
36 },
37 },
38
39 concurrency: {
40 maxConcurrent: 1,
41 timeout: 120_000, // 2 minute timeout
42 },
43});
44
45// Preload model on startup
46await g.loadModel();
47console.log("Gerbil ready for production");

Health Checks

For production deployments, implement a health check endpoint:

health.ts
01// Express
02app.get("/health", async (req, res) => {
03 try {
04 const info = g.getInfo();
05 res.json({
06 status: "ok",
07 model: info.model?.id,
08 device: info.device.backend,
09 ready: info.device.status === "ready",
10 });
11 } catch (error) {
12 res.status(503).json({ status: "error", message: String(error) });
13 }
14});

Next Steps