webAI-ColVec1 and the Case for Smarter Retrieval Models

April 14, 2026

By retrieving directly from document pages instead of relying on OCR-first pipelines, webAI-ColVec1 reached #1 on ViDoRe V3 and makes the case for smarter retrieval models.

For the last few years, much of the AI conversation has centered on scale: bigger models, bigger training runs, bigger infrastructure, bigger claims.

But in practice, some of the most important advances are coming from somewhere else: models that are more efficient, more specialized, and better matched to the task in front of them.

That is especially true in document retrieval.

Most document retrieval systems still rely on OCR (optical character recognition) which turns document pages into extracted text before retrieval begins. That can work for clean text, but it often breaks down on the documents that matter most in real-world enterprise environments: tables, charts, scanned pages, dense layouts, and visually complex PDFs.

We set out to solve that problem by building the most accurate end-to-end retrieval model we could, designed from the start to skip OCR entirely. webAI-ColVec1 retrieves directly from rendered page images instead of relying on text extraction as an intermediate step, preserving more of the document’s original structure and meaning. And we’re open sourcing it to the community.

And as of today, that approach has earned a major validation point: ColVec1 is ranked #1 on the ViDoRe V3 leaderboard, the gold standard for multimodal enterprise document visual retrieval. That result is not just a milestone for this model family. It is a signal that the future of AI will not be won by scale alone. It will be won by smarter, more efficient models built for real-world tasks.

Why is document retrieval so important?

As large language models and their context windows grow, it becomes more tempting to pass entire documents or databases directly into the prompt. In practice, that rarely works as well as people hope. Large inputs are expensive, difficult to manage, and often less reliable than they appear.

Just as importantly, they make it harder to trace where an answer actually came from. If a model is given an enormous mass of content all at once, it becomes less clear which page, chart, image, or section it relied on to answer the question. That is a real problem in enterprise environments, where answers often need to be reviewed, referenced, and audited.

Retrieval helps narrow that problem. Instead of asking a model to reason over everything at once, it helps surface the most relevant source material first. That makes the overall system more efficient, more relevant, and easier to trust.

The problem with OCR-first document retrieval

OCR-first pipelines introduce a structural mismatch between documents and retrieval systems. A document page is inherently visual. It encodes meaning not only in text, but in layout, hierarchy, spacing, and graphical elements. When that page is reduced to extracted text, much of that structure is either lost or approximated.

This loss becomes critical in documents where meaning depends on layout. Tables lose alignment, charts lose context, and multi-column documents collapse into ambiguous sequences of text. Errors introduced during OCR propagate forward into indexing and retrieval, making downstream systems less reliable. At the same time, the preprocessing step adds latency and complexity to production pipelines.

In production environments, the documents that matter most are often the hardest ones to handle with brittle OCR-first systems: technical manuals, financial filings, healthcare documents, scientific papers, government reports, and pages with dense tables or mixed visual structure. We believe retrieval systems should be trained on that reality, not abstracted away from it.

Image from: ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

Key takeaway: Most document retrieval systems still start by turning PDFs into text. ColVec1 starts with the page itself. That shift matters.

What ColVec1 Is

ColVec 1 is part of our ColVec family of vision-language retrieval models, designed specifically for visual document retrieval. We trained multiple variants across both 4B and 9B backbones, with retrieval embedding sizes that reflect different deployment tradeoffs between quality and efficiency.

Across those variants, the core idea stays the same. We feed document pages as images and queries as text, then train the model to place matching query-page pairs closer together in a retrieval embedding space. The result is a model family that can search directly over full document pages without requiring an OCR pipeline.

This approach was popularized by models like ColPali. ColVec1 extends that direction with stronger training and scaling across model sizes.

ColVec1 at a glance

CategoryDetail
Task
Visual document retrieval
Approach
Direct image retrieval, not OCR-first
Model family
Qwen 3.5-based vision-language retrieval models
Variants
4B and 9B backbones
Embedding sizes
640 (4B), 2560 (9B)
Benchmark
640 (4B), 2560 (9B)
Result
#1 on leaderboard

What it took to build ColVec1

Getting strong visual retrieval performance was not the result of a single trick. It came from a training pipeline designed to expose the model to a wide range of real-world document types and reinforce the kinds of supervision that matter most for retrieval.

The training mixture combined internally collected document corpora, public document datasets adapted into retrieval format, and existing public retrieval datasets that were already closer to the target task. Together, these sources produced a deliberately heterogeneous mixture of scientific papers, enterprise-style reports, multilingual documents, financial materials, visually rich tables, and synthetic retrieval supervision.

That breadth was important. We did not want to optimize for one narrow document type and then generalize poorly outside it. We wanted the model to learn the reality of production retrieval, where page structure, content density, and document quality vary dramatically from one corpus to the next.

To build domain-specific synthetic document data, we collected documents from the open web, rendered each page into an image, and generated retrieval-style queries using a vision-language model. The goal was not to produce generic descriptions of pages. It was to produce the kind of question a user would plausibly ask when trying to find that page inside a large multimodal document collection.

We also adapted public document datasets into retrieval supervision. In cases where a dataset provided page images but not retrieval-ready queries, we sampled pages and generated query-page pairs to normalize them into a unified retrieval format. That allowed us to train across scientific, business, multilingual, financial, and visually complex document collections through one consistent pipeline.

The result was a multimodal dataset of roughly 2 million question-image pairs, built not around benchmark-only distributions, but around a wider range of document types that more closely reflect the environments in which retrieval systems actually operate.

Training recipe

ComponentConfiguration
Backbone
Qwen 3.5 vision-language model
Adaptation method
LoRA + retrieval projection layer
LoRA settings
Rank 32, alpha 32, dropout 0.1
Optimizer
paged_adamw_8bit
Learning rate
5e-5
Scheduler
Linear
Hardware
8 A100 GPUs
Effective batch size
512
Loss function
Proprietary

This scale matters because ColVec1 uses in-batch negatives. For every query in a batch, its matched page is the positive example, while the other pages in that batch serve as negatives. A larger effective batch size gives each query a larger pool of competing pages, which strengthens the contrastive learning signal.

How the benchmark works

ViDoRe V3, short for Visual Document Retrieval Version 3, is quickly becoming the gold standard for evaluating multimodal document retrieval in enterprise settings. It tests models against roughly 26,000 document page images and more than 3,000 human-verified queries across 10 professional domains, including finance, pharma, energy, and industrial documents with dense tables, mixed layouts, and visual complexity.

That is what makes it a meaningful benchmark for this launch. ViDoRe V3 is not built around simplified text retrieval tasks. It is built to measure how well a model retrieves information from the kinds of documents real-world teams actually work with.

Models are ranked using NDCG@10, which measures how well a system surfaces the most relevant pages within its top results. A top result here says something specific: the model can compete on a harder, more realistic retrieval problem than benchmarks built around cleaner, more text-centric corpora.

Why this result matters

The leaderboard result matters, of course. But the larger point is why it matters.

First, it validates direct image retrieval as a serious path forward. Strong retrieval performance does not require an OCR-first intermediate. ColVec1 shows that a model trained directly on rendered document pages can compete at the highest level on a benchmark built around visually complex enterprise documents.

Second, it validates specialized training over brute-force scale. This result did not come from throwing a larger general-purpose model at the problem. It came from a deliberate retrieval-specific recipe: a heterogeneous training mixture, targeted reweighting, efficient adaptation, and an objective built for ranking quality.

Third, it reinforces a broader point about where AI is going. For years, the default assumption has been that stronger performance comes mainly from scaling up: bigger models, bigger infrastructure, bigger budgets. But many practical systems do not need the largest possible model. They need the right model for the job: one that is efficient to run, trained for the actual structure of the task, and capable of delivering reliable performance in production settings.

ColVec1 is a proof point for that shift. In visual document retrieval, smarter model design is not a compromise. It is the advantage.

What comes next

We are open sourcing ColVec1 because we think this shift should be visible, reproducible, and useful.

Over time, we expect retrieval to become an even more important layer in production AI systems, especially as enterprises work with increasingly multimodal, visually complex, and domain-specific knowledge bases. Better generation gets a lot of attention. But better retrieval is often the thing that determines whether a system is genuinely useful.

ColVec is our contribution to that direction: a retrieval model family built for real documents, trained for real retrieval behavior, and now publicly available for developers and researchers to explore.

This is not the end of the story. It is one proof point in a larger shift already underway.

The next chapter of AI will not be defined only by the biggest models in the world.

It will also be defined by the smartest ones.

Both models are available on Hugging Face:

webAI-ColVec1-9B

webAI-ColVec1-4B