Response Caching
Cache inference responses for repeated prompts to achieve instant results on subsequent calls.
Note: Response caching is different from KV cache (attention state cache). Response caching stores complete generation outputs while KV cache stores internal model states for conversation context.
Enable Response Caching
Pass cache: true to enable caching for a generation call:
01import { Gerbil } from "@tryhamster/gerbil";02
03const g = new Gerbil();04await g.loadModel("qwen3-0.6b");05
06// First call: ~150ms (runs inference)07const result = await g.generate("What is 2+2?", { cache: true });08console.log(result.text); // "4"09console.log(result.cached); // false10
11// Second call: ~0ms (returns from cache!)12const cached = await g.generate("What is 2+2?", { cache: true });13console.log(cached.text); // "4"14console.log(cached.cached); // trueCustom TTL
By default, cached responses expire after 5 minutes. Customize the TTL with cacheTtl:
// Cache for 10 minutesawait g.generate("Explain quantum computing", { cache: true, cacheTtl: 10 * 60 * 1000 // milliseconds});
// Cache for 1 hourawait g.generate("What's the capital of France?", { cache: true, cacheTtl: 60 * 60 * 1000});
// Cache for 30 seconds (useful for dynamic content)await g.generate("Generate a random fact", { cache: true, cacheTtl: 30 * 1000});Cache Statistics
Monitor cache performance with getResponseCacheStats():
const stats = g.getResponseCacheStats();console.log(stats);// {// hits: 5, // Number of cache hits// misses: 3, // Number of cache misses// size: 3, // Number of cached entries// hitRate: 62.5 // Hit rate percentage// }Clear Response Cache
Clear all cached responses when needed:
// Clear all cached responsesg.clearResponseCache();
// Verify it's clearedconst stats = g.getResponseCacheStats();console.log(stats.size); // 0How Cache Keys Work
The cache key is a hash of the following parameters. Different values = different cache entries:
- —Prompt text
- —Model ID
- —
maxTokens - —
temperature - —
topPandtopK - —System prompt
- —Thinking mode
// These are cached separately (different temperature)await g.generate("Hello", { cache: true, temperature: 0.7 });await g.generate("Hello", { cache: true, temperature: 0.3 });
// These share the same cache entryawait g.generate("Hello", { cache: true });await g.generate("Hello", { cache: true }); // Cache hit!Limitations
Response caching is not supported for:
- ✕Streaming calls — when using
onTokencallback orstream() - ✕Vision/image calls — when passing
imagesoption
KV Cache vs Response Cache
Gerbil has two caching mechanisms that serve different purposes:
| Feature | KV Cache | Response Cache |
|---|---|---|
| What's cached | Attention states | Full responses |
| Purpose | Conversation context | Repeated prompts |
| Clear method | clearCache() | clearResponseCache() |
| Default | Always on | Off (enable with cache: true) |
Best Practices
- —Use caching for deterministic prompts — factual questions, data extraction, and classification tasks benefit most
- —Set low TTL for creative content — if you want variation, use short TTLs or skip caching
- —Monitor cache stats — track hit rates to understand if caching is effective for your use case
- —Clear cache when updating prompts — if you change system prompts or parameters, clear the cache to avoid stale responses
Testing
Run the response caching test to verify everything works:
npx tsx examples/test-cache.ts