Retrieval-augmented generation (RAG) has become the standard way of grounding large language models (LLMs) in real-world knowledge. By pulling in relevant documents at query time, RAG reduces hallucinations and makes AI systems more useful across domains.
But adapting retrieval systems to new domains remains a costly, inefficient process. Fully fine-tuning a large model is compute-heavy and often infeasible on resource-constrained edge devices. Standard similarity search, usually implemented with dot-product lookups in a vector database, is static and suboptimal, struggling to capture the nuances of specialized corpora.
As AI shifts closer to customer data, running on devices at the edge and inside regulated environments, the challenge becomes clear: We need retrieval architectures that are lighter, faster, and privacy-preserving.
That’s the problem our new research tackles. In our newest paper, published on arXiv, we introduce a novel retrieval architecture that combines adapters for soft embeddings with a classifier-as-retriever approach. Together, these innovations dramatically improve accuracy while reducing training costs, and they integrate naturally with federated and privacy-preserving training strategies.
1. Adapters for soft embeddings
Instead of fully fine-tuning a large encoder, we start with a frozen small language model (SLM). Between the tokenizer and the transformer blocks of the SLM, we insert a lightweight adapter; a simple transformation matrix that is trainable.
This adapter reshapes the token embeddings into what we call soft embeddings: Domain-tuned representations of the corpus. Unlike full fine-tuning, this approach leaves the base model untouched, but still adapts the embedding space to capture the terminology and structure of the target domain. The result is a much more memory-efficient way of steering a model toward specialized knowledge.
2. Classifier-as-Retriever (CaR)
Traditional RAG systems rely on maximum inner product search (MIPS) to measure similarity between query embeddings and document embeddings. While effective at scale, MIPS is a fixed heuristic in that it cannot learn or adapt.
We propose an alternative: Attaching a classifier head to the frozen SLM. Trained on query–document pairs, the classifier learns to map queries directly to their corresponding documents. This transforms retrieval from a static lookup into a trainable similarity function that improves with exposure to domain-specific data.
In experiments, this shift made a dramatic difference. On frozen off-the-shelf SLMs, MIPS achieved only ~12% top-1 accuracy on downstream retrieval tasks. By contrast, our classifier-as-retriever approach boosted accuracy to between 96–99%, a total step change in retrieval quality.
Why these matter together
Both innovations — adapters for soft embeddings and classifier-as-retriever — can be used independently. The adapter improves how embeddings are shaped; the classifier makes retrieval adaptive rather than static. Combined, they provide a lightweight but powerful way to customize retrieval for any domain, without the overhead of full model fine-tuning.
Our approach offers two different training paths, each with its own balance of speed and accuracy.
Option A: Classifier-only training
Option B: Classifier + adapters
The choice depends on application needs. For some scenarios, “fast enough” accuracy combined with speed is ideal. For others, squeezing out the last percentage points of accuracy justifies the additional cost.
Retrieval accuracy is only one part of the challenge. Training must also be fast and privacy-preserving, especially when data lives on the edge.
Results:
Key insight: These benefits are orthogonal. Even if you don’t adopt our retrieval architecture, federated training and differential privacy still make distributed learning faster and more secure.
Enterprises increasingly need retrieval systems that adapt to their own knowledge bases while meeting strict performance and privacy constraints. Our approach delivers on several fronts:
Traditional approaches rely on large models, full fine-tuning, and static similarity search. Our approach flips the equation:
The leap in accuracy — from ~12% with MIPS on a frozen SLM to ~96–99% with classifier retriever — shows how transformative this shift can be.
This research demonstrates a path to making retrieval smarter, faster, and safer for enterprise AI.
For the proofs, math, and experimental benchmarks, see the full paper on arXiv.
But the real story is what lies ahead: scaling these techniques across larger datasets, extending them into new domains, and weaving them into the broader fabric of distributed intelligence.
This is one milestone of many on our journey. And we’re just getting started.