Implementing Dynamic Context Pruning for Long-Context LLM Agents

As of March 2026, the industry has largely moved past the 'context window wars.' With models like Gemini 2.0 and GPT-5 variants supporting millions of tokens, the bottleneck is no longer capacity, but the economic and performance cost of processing massive prompts. For engineers building autonomous agents, the challenge is managing 'context drift' and the quadratic cost scaling of attention mechanisms.

This post explores a production-ready pattern for dynamic context pruning: a technique to programmatically strip irrelevant information from an agent's context window before it hits the inference endpoint, maintaining high reasoning quality while slashing latency.

The Problem: The 'Lost in the Middle' and Token Bloat

Even with 1M+ token windows, LLMs still suffer from the 'lost in the middle' phenomenon, where retrieval performance degrades as context grows. Furthermore, sending 100k tokens for every turn in a multi-turn conversation is financially unsustainable for most B2B applications.

Standard FIFO (First-In, First-Out) buffer clearing is too blunt. It often deletes the very system instructions or early-stage user constraints that define the agent's behavior. We need a more surgical approach.

Architectural Pattern: The Semantic Pruning Pipeline

Instead of passing a raw array of messages to your LLM provider, implement a middleware layer that scores and filters context based on three vectors: Recency, Importance, and Relevance (RIR).

1. Recency Scoring

Recency is the simplest metric. The last 3-5 messages are almost always critical for immediate coherence. We assign a decaying weight to older messages.

2. Importance Scoring (The 'Anchor' Tokens)

Certain messages contain 'anchors'—system prompts, schema definitions, or explicit user constraints (e.g., "Always output JSON"). These should be flagged with a PROTECTED status in your database or state manager.

3. Semantic Relevance (The Pruning Engine)

This is where we use a small, fast embedding model (like Hugging Face Transformers.js or OpenAI's text-embedding-3-small) to compare the current user query against the historical context. If a historical message has a cosine similarity below a specific threshold (e.g., 0.65), it is a candidate for pruning.

Implementation in TypeScript

Below is a simplified implementation of a context manager that prunes based on semantic relevance while protecting system instructions.

import { cosineSimilarity } from './math-utils';
import { getEmbeddings } from './embedding-service';

interface Message {
  role: 'system' | 'user' | 'assistant';
  content: string;
  metadata: {
    isProtected: boolean;
    embedding?: number[];
  };
}

async function pruneContext(
  history: Message[],
  currentQuery: string,
  maxTokens: number = 4000
): Promise<Message[]> {
  const queryEmbedding = await getEmbeddings(currentQuery);
  
  // 1. Always keep protected messages (System prompts, etc.)
  const protectedMessages = history.filter(m => m.metadata.isProtected);
  
  // 2. Score unprotected messages by semantic similarity to the current query
  const scoredHistory = await Promise.all(
    history
      .filter(m => !m.metadata.isProtected)
      .map(async (msg) => {
        const msgEmbedding = msg.metadata.embedding ?? await getEmbeddings(msg.content);
        const score = cosineSimilarity(queryEmbedding, msgEmbedding);
        return { ...msg, score };
      })
  );

  // 3. Sort by score and fill remaining token budget
  const sortedHistory = scoredHistory.sort((a, b) => b.score - a.score);
  
  let currentTokenCount = estimateTokens(protectedMessages);
  const finalContext: Message[] = [...protectedMessages];

  for (const msg of sortedHistory) {
    const msgTokens = estimateTokens([msg]);
    if (currentTokenCount + msgTokens <= maxTokens) {
      finalContext.push(msg);
      currentTokenCount += msgTokens;
    }
  }

  // 4. Re-sort by original chronological order to maintain flow
  return finalContext.sort((a, b) => a.timestamp - b.timestamp);
}

Advanced Strategy: Summarization Compression

Pruning isn't just about deleting; it's about condensing. For messages that fall into the 'medium relevance' bucket (similarity 0.4 - 0.6), instead of deleting them, we can pass them through a 'Summarizer Agent.'

This agent uses a cheaper model (like Llama 3.1 8B) to turn a 500-word transcript into a 50-word bulleted summary. This preserves the 'essence' of the conversation without the token overhead. This is particularly effective for RAG (Retrieval-Augmented Generation) pipelines where the retrieved chunks are verbose.

Tradeoffs and Considerations

The Cold Start Problem

If you prune too aggressively, the model might lose the 'vibe' or persona established in earlier turns. To mitigate this, always keep a 'rolling window' of the last 2 assistant responses regardless of their semantic score. This ensures the model maintains its immediate conversational style.

Latency Overhead

Running an embedding call and a similarity search adds latency. However, in 2026, local embedding models running in the edge (via ONNX Runtime) can process a message in <10ms. This is a negligible trade-off compared to the 2-3 seconds saved by reducing the prompt size by 50k tokens.

Evaluation (LLM-as-a-Judge)

How do you know if your pruning is too aggressive? Use an evaluation framework like DeepEval or Ragas. Run your agent through a 'Golden Dataset' of complex queries with and without pruning. Compare the 'Faithfulness' and 'Answer Relevancy' scores. If the scores diverge by more than 5%, your similarity threshold is likely too high.

Conclusion

Context management is the new 'memory management' for the AI era. As we build more complex agents, we cannot rely on the brute force of large context windows. By implementing a semantic pruning layer, you ensure your agents remain fast, cost-effective, and focused on the task at hand. The goal is not to give the model all the data, but the right data at the right time.