Optimizing RAG with Hybrid Search and Reciprocal Rank Fusion

By mid-2026, the initial excitement around Retrieval-Augmented Generation (RAG) has matured into a rigorous engineering discipline focused on retrieval precision. While dense vector embeddings (semantic search) are excellent at capturing conceptual relationships, they frequently fail on exact matches, acronyms, and specific product IDs—areas where traditional BM25 lexical search excels.

To build production-grade RAG systems today, relying on a single retrieval strategy is no longer sufficient. This post explores the implementation of Hybrid Search using Reciprocal Rank Fusion (RRF) to combine the strengths of vector and keyword search.

The Problem: Semantic Drift and Keyword Loss

Pure vector search often suffers from 'semantic drift.' For example, a query for 'AWS SDK v3' might return results for 'AWS SDK v2' because they are semantically similar, even though the version number is a critical filter. Conversely, keyword search fails when a user asks for 'cloud storage' but the document only mentions 'S3 buckets.'

Hybrid search solves this by running two parallel queries:

Dense Retrieval: Using embeddings (e.g., OpenAI text-embedding-3-small or local HuggingFace models) to find conceptual matches.
Sparse Retrieval: Using BM25 or Full-Text Search (FTS) to find exact token matches.

Architecture: PostgreSQL as a Unified Store

In 2026, many teams have consolidated their stack by using pgvector for vector storage alongside PostgreSQL's native Full-Text Search. This eliminates the operational overhead of syncing data between a primary database and a standalone vector store.

Schema Design

To support hybrid search, your table needs both a vector column and a tsvector column for lexical indexing.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  content text NOT NULL,
  embedding vector(1536), -- For OpenAI embeddings
  fts_tokens tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON documents USING gin (fts_tokens);

Implementing Reciprocal Rank Fusion (RRF)

Once you have two sets of results, you cannot simply add their scores together. Vector scores are typically cosine similarities (0 to 1), while BM25 scores are unbounded. Reciprocal Rank Fusion (RRF) is the industry-standard algorithm for merging these disparate rankings without needing to normalize scores.

The RRF Formula

The score for a document $d$ is calculated as:
$RRFscore(d) = \sum_{r \in R} \frac{1}{k + rank(d, r)}$

Where $R$ is the set of rankers (Vector and FTS), $rank(d, r)$ is the position of document $d$ in ranker $r$, and $k$ is a constant (usually 60) that prevents documents with very high ranks from dominating the results.

TypeScript Implementation

Using Drizzle ORM or Prisma, you can execute a raw SQL query that performs both searches and merges them using RRF in a single database round-trip.

import { sql } from 'drizzle-orm';

async function hybridSearch(query: string, embedding: number[], limit = 5) {
  const k = 60;
  
  // We use a Common Table Expression (CTE) to get both rankings
  const results = await db.execute(sql`
    WITH vector_search AS (
      SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> ${JSON.stringify(embedding)}::vector) as rank
      FROM documents
      ORDER BY embedding <=> ${JSON.stringify(embedding)}::vector
      LIMIT 50
    ),
    fts_search AS (
      SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(fts_tokens, plainto_tsquery('english', ${query})) DESC) as rank
      FROM documents
      WHERE fts_tokens @@ plainto_tsquery('english', ${query})
      ORDER BY ts_rank(fts_tokens, plainto_tsquery('english', ${query})) DESC
      LIMIT 50
    )
    SELECT 
      COALESCE(v.id, f.id) as id,
      (COALESCE(1.0 / (${k} + v.rank), 0.0) + 
       COALESCE(1.0 / (${k} + f.rank), 0.0)) as rrf_score
    FROM vector_search v
    FULL OUTER JOIN fts_search f ON v.id = f.id
    ORDER BY rrf_score DESC
    LIMIT ${limit};
  `);

  return results;
}

Tradeoffs and Tuning

1. The Constant K

Setting $k=60$ is the standard recommendation from the original RRF paper. However, if you find that your keyword search is significantly more reliable than your embeddings (common in highly technical domains), decreasing $k$ can make the top-ranked items more influential.

2. Performance Latency

Running two searches is inherently more expensive than one. To mitigate this:

Use HNSW indexes for the vector component to ensure sub-millisecond retrieval.
Limit the depth of the initial rankings (e.g., take the top 50 from each) before applying RRF.
Use pg_trgm for fuzzy keyword matching if your users frequently make typos, though this is slower than standard FTS.

3. Re-ranking (The Second Stage)

For high-stakes RAG applications, RRF is often the first stage of a two-stage retrieval process. After retrieving the top 10-20 documents via RRF, you can pass them to a Cross-Encoder model (like BGE-Reranker) which performs a much more computationally expensive but accurate comparison between the query and the document text.

Evaluation: How to Know It's Working

You cannot optimize what you don't measure. Use a framework like Ragas or Arize Phoenix to track metrics such as:

Faithfulness: Does the answer only use the retrieved context?
Answer Relevance: Does the answer actually address the user's prompt?
Context Precision: Is the ground-truth information ranked highly in your hybrid results?

In our testing, switching from pure vector search to Hybrid + RRF typically improves Context Recall by 15-20% in technical documentation use cases.

Conclusion

Hybrid search with RRF is the current gold standard for robust RAG retrieval. By combining the semantic power of embeddings with the precision of keyword search—and managing it all within a single PostgreSQL instance—engineers can build AI systems that are both conceptually intelligent and factually accurate. As we move further into 2026, the focus remains on this 'retrieval hygiene' as the primary lever for improving LLM performance.