Caching

Prompteus caching features can significantly improve your Neurons’ performance, and lower your AI provider API costs by storing and reusing responses. This guide explains how to configure and use caching effectively.

By caching responses, you can:

Reduce costs by avoiding unnecessary AI provider API calls
Improve response times by returning cached results instantly
Maintain consistency for identical or similar queries
Scale your application more efficiently by reducing API usage

Caching is available for all Neurons, and is free to use. However, the cache is only available for the current deployment of the Neuron. If you deploy a new version of the Neuron with the same slug, the cache will be reset.

Caching Settings

The Caching section of the Neuron Settings page allows you to control how responses are cached and reused.

Exact Caching

Exact Caching settings in the Neuron Settings page

Exact caching allows Prompteus to compare the input data with previously processed and cached input data, and return a cache hit in case of an exact match. This is useful for scenarios where you expect identical inputs to produce identical outputs.

When enabled, if a request is made with exactly the same input as a previous request, the cached response will be returned immediately without making any API calls to your AI providers. This means you won’t be charged for repeated identical requests to your AI providers.

Exact caching is particularly useful for deterministic operations or when you want to ensure consistent responses for identical inputs. It’s also the fastest and most efficient caching strategy as it completely eliminates API calls for repeated requests.

Semantic Caching

Semantic Caching settings in the Neuron Settings page

Semantic caching takes caching a step further by using AI embeddings to identify similar inputs. Prompteus will generate embeddings from your inputs and compare these to previously cached embeddings. Depending on the threshold of similarity you define, Prompteus will return the best match.

This feature uses Prompteus-hosted models to generate embeddings and will not consume your provider API credentials. This means you can benefit from semantic matching without incurring additional costs from your AI provider, as the embedding generation and comparison is handled by Prompteus.

Semantic Caching Similarity Threshold

The similarity threshold determines how similar an input must be to a cached input to trigger a cache hit. The threshold is expressed as a percentage, where:

A higher percentage (e.g., 99%) requires inputs to be nearly identical for a cache hit
A lower percentage (e.g., 50%) allows for more variation between inputs while still returning cached results

We recommend starting with a high similarity threshold and gradually lowering it based on your specific use case and the observed quality of cached responses.

While caching can significantly improve performance and reduce costs, it’s important to carefully consider the similarity threshold when using semantic caching. Too low a threshold might return responses that don’t accurately match the user’s intent.

Cache Duration

Cached responses are stored for a limited time to ensure freshness of data. The exact duration may vary based on your subscription plan and usage patterns.

You can bypass the cache for individual requests using the bypassCache parameter when calling the API or SDK. This is useful for testing or when you need to force a fresh response:

https://run.prompteus.com/<org>/<neuron>?bypassCache=true

Bypassing cache will not purge the cache for the Neuron, only for the specific request. If you need to purge the cache for the Neuron, you can do so by deploying a new version of the Neuron.

Best Practices

Here are some recommendations for effective use of caching:

Enable exact caching for deterministic operations where identical inputs should always produce identical outputs
Use semantic caching when slight variations in input should still return the same or similar responses
Adjust the similarity threshold based on your specific use case:
- Higher thresholds for tasks requiring precise matches
- Lower thresholds for more general queries where approximate matches are acceptable
Monitor cache hit rates and response quality through execution logs to fine-tune your caching settings
For cost optimization, analyze your most frequent queries and adjust caching settings to maximize cache hits for these high-volume requests

Upcoming Features

We are working on adding the ability for the cache to be namespaced not only by Neuron version, but also by user. This will allow you to cache different responses for different users of your application. Contact us if you are interested in this feature.

Introduction

Neurons

Features

Settings & Billing

Caching Settings

Exact Caching

Semantic Caching

Semantic Caching Similarity Threshold

Cache Duration

Best Practices

Upcoming Features

Introduction

Neurons

Features

Settings & Billing

​Caching Settings

​Exact Caching

​Semantic Caching

​Semantic Caching Similarity Threshold

​Cache Duration

​Best Practices

​Upcoming Features

Caching Settings

Exact Caching

Semantic Caching

Semantic Caching Similarity Threshold

Cache Duration

Best Practices

Upcoming Features