Caching
Optimize your Neurons’ performance and lower your AI costs with caching settings to store and reuse responses for similar requests.
Prompteus caching features can significantly improve your Neurons’ performance, and lower your AI provider API costs by storing and reusing responses. This guide explains how to configure and use caching effectively.
By caching responses, you can:
- Reduce costs by avoiding unnecessary AI provider API calls
- Improve response times by returning cached results instantly
- Maintain consistency for identical or similar queries
- Scale your application more efficiently by reducing API usage
Caching is available for all Neurons, and is free to use. However, the cache is only available for the current deployment of the Neuron. If you deploy a new version of the Neuron with the same slug, the cache will be reset.
Caching Settings
The Caching section of the Neuron Settings page allows you to control how responses are cached and reused.
Exact Caching
Exact Caching settings in the Neuron Settings page
Exact caching allows Prompteus to compare the input data with previously processed and cached input data, and return a cache hit in case of an exact match. This is useful for scenarios where you expect identical inputs to produce identical outputs.
When enabled, if a request is made with exactly the same input as a previous request, the cached response will be returned immediately without making any API calls to your AI providers. This means you won’t be charged for repeated identical requests to your AI providers.
Exact caching is particularly useful for deterministic operations or when you want to ensure consistent responses for identical inputs. It’s also the fastest and most efficient caching strategy as it completely eliminates API calls for repeated requests.
Semantic Caching
Semantic Caching settings in the Neuron Settings page
Semantic caching takes caching a step further by using AI embeddings to identify similar inputs. Prompteus will generate embeddings from your inputs and compare these to previously cached embeddings. Depending on the threshold of similarity you define, Prompteus will return the best match.
This feature uses Prompteus-hosted models to generate embeddings and will not consume your provider API credentials. This means you can benefit from semantic matching without incurring additional costs from your AI provider, as the embedding generation and comparison is handled by Prompteus.
Semantic Caching Similarity Threshold
The similarity threshold determines how similar an input must be to a cached input to trigger a cache hit. The threshold is expressed as a percentage, where:
- A higher percentage (e.g., 99%) requires inputs to be nearly identical for a cache hit
- A lower percentage (e.g., 50%) allows for more variation between inputs while still returning cached results
We recommend starting with a high similarity threshold and gradually lowering it based on your specific use case and the observed quality of cached responses.
While caching can significantly improve performance and reduce costs, it’s important to carefully consider the similarity threshold when using semantic caching. Too low a threshold might return responses that don’t accurately match the user’s intent.
Cache Duration
Cached responses are stored for a limited time to ensure freshness of data. The exact duration may vary based on your subscription plan and usage patterns.
You can bypass the cache for individual requests using the bypassCache
parameter when calling the API or SDK. This is useful for testing or when you need to force a fresh response:
Bypassing cache will not purge the cache for the Neuron, only for the specific request. If you need to purge the cache for the Neuron, you can do so by deploying a new version of the Neuron.
Best Practices
Here are some recommendations for effective use of caching:
- Enable exact caching for deterministic operations where identical inputs should always produce identical outputs
- Use semantic caching when slight variations in input should still return the same or similar responses
- Adjust the similarity threshold based on your specific use case:
- Higher thresholds for tasks requiring precise matches
- Lower thresholds for more general queries where approximate matches are acceptable
- Monitor cache hit rates and response quality through execution logs to fine-tune your caching settings
- For cost optimization, analyze your most frequent queries and adjust caching settings to maximize cache hits for these high-volume requests
Upcoming Features
We are working on adding the ability for the cache to be namespaced not only by Neuron version, but also by user. This will allow you to cache different responses for different users of your application. Contact us if you are interested in this feature.