cd..blog

Optimizing LLM Context with Semantic Cache and Dynamic RAG Re-ranking

const published = "May 21, 2026, 10:08 PM";const readTime = 5 min;
LLM EngineeringRAGVector DatabasesSemantic CacheInference Optimization
Reduce LLM latency and token costs by implementing a semantic caching layer and a cross-encoder re-ranking pipeline for high-precision RAG applications.

Optimizing LLM Context with Semantic Cache and Dynamic RAG Re-ranking

As of May 2026, the bottleneck in production AI applications has shifted from model availability to context efficiency and inference cost. While context windows have expanded to millions of tokens, the "lost in the middle" phenomenon persists, and the financial cost of processing massive prompts remains a significant hurdle for scaling. To build production-grade Retrieval-Augmented Generation (RAG) systems, engineers must move beyond simple cosine similarity searches and implement a multi-layered retrieval architecture.

This post explores two critical patterns for optimizing LLM context: Semantic Caching to eliminate redundant computations and Cross-Encoder Re-ranking to ensure only the most relevant data enters the context window.

The Problem: The High Cost of Redundant and Noisy Context

Standard RAG implementations often suffer from two primary inefficiencies:

  1. Redundant Inference: Users frequently ask semantically similar questions. Without a cache, every request triggers a full embedding search and a costly LLM generation.
  2. Low Precision Retrieval: Vector databases using Bi-Encoders (like OpenAI's text-embedding-3-small) are fast but lack the nuance to distinguish between documents that share keywords but differ in intent.

Feeding 20 retrieved chunks into an LLM when only 3 are relevant increases latency, raises costs, and degrades the quality of the response.

Layer 1: Implementing a Semantic Cache

A semantic cache stores previous LLM responses and retrieves them based on the semantic similarity of new queries rather than exact string matches. This can reduce latency from seconds to milliseconds for cached hits.

Choosing a Similarity Threshold

The core challenge is the threshold. If it's too high, you miss cache hits; too low, and you serve irrelevant answers. In production, we typically use a distance metric like Euclidean distance or Cosine similarity within a vector store like Redis or Milvus. Redis is particularly effective here due to its low-latency key-value nature combined with robust vector indexing.

import { createClient } from 'redis';

const redis = createClient();
await redis.connect();

async function getSemanticCache(queryEmbedding: number[]) {
  const result = await redis.ft.search('idx:cache', `*=>[KNN 1 @vector $BLOB AS score]`, {
    PARAMS: { BLOB: Buffer.from(new Float32Array(queryEmbedding).buffer) },
    RETURN: ['response', 'score'],
    SORTBY: 'score'
  });

  // Threshold of 0.1 for high-precision matching
  if (result.total > 0 && Number(result.documents[0].value.score) < 0.1) {
    return result.documents[0].value.response;
  }
  return null;
}

Layer 2: Precision Retrieval with Cross-Encoder Re-ranking

When a cache miss occurs, we fall back to RAG. However, retrieving the top k documents from a vector database is often insufficient. Bi-encoders (used for indexing) map queries and documents independently into a vector space, which loses the interaction between the two.

To solve this, we introduce a Re-ranker (Cross-Encoder). A Cross-Encoder processes the query and a retrieved document simultaneously, producing a much more accurate relevance score. Since Cross-Encoders are computationally expensive, we only apply them to the top 25-50 results returned by the initial vector search.

The Re-ranking Pipeline

  1. Initial Retrieval: Fetch 50 candidates using a fast Bi-Encoder (e.g., Pinecone or Weaviate).
  2. Re-ranking: Pass the query and the 50 candidates through a model like BGE-Reranker or Cohere Re-rank.
  3. Context Selection: Take the top 5 re-ranked results for the LLM prompt.
async function rerankDocuments(query: string, documents: string[]) {
  // Using a hypothetical local inference endpoint or Cohere API
  const response = await fetch('https://api.cohere.ai/v1/rerank', {
    method: 'POST',
    body: JSON.stringify({
      model: 'rerank-english-v3.0',
      query: query,
      documents: documents,
      top_n: 5
    })
  });
  return await response.json();
}

Architectural Tradeoffs

Implementing this architecture introduces complexity that must be balanced against performance gains.

Latency vs. Accuracy

Adding a re-ranking step adds roughly 100-300ms to the retrieval pipeline. However, by reducing the number of tokens sent to the LLM, you often recover this time during the generation phase. Smaller, more relevant contexts allow for faster Time-To-First-Token (TTFT) and shorter overall generation times.

Cache Invalidation

Semantic caches are notoriously difficult to invalidate. Unlike traditional TTL-based caches, a change in the underlying data doesn't automatically invalidate a semantically similar entry. Engineers should implement a versioning system in the cache metadata or use a hybrid approach where the cache is bypassed if the underlying knowledge base has been updated within a certain window.

Evaluation: Measuring Success

You cannot optimize what you don't measure. For this architecture, track three specific metrics:

  1. Cache Hit Rate: The percentage of queries resolved by the semantic cache.
  2. Context Precision: The ratio of relevant chunks to total chunks in the prompt (measured via LLM-as-a-judge or RAGAS).
  3. Cost per Query: The total spend on embeddings, re-ranking, and LLM tokens.

Conclusion

As we move into mid-2026, the "brute force" approach to RAG—shoving as much data as possible into a large context window—is being replaced by sophisticated retrieval pipelines. By implementing a semantic cache with Redis and a re-ranking layer with specialized Cross-Encoders, you can build AI systems that are not only more accurate but significantly more cost-effective and responsive. Focus on the quality of the context, and the LLM performance will follow.