What webAI's 94% Accuracy on RobustQA Benchmark Means for Real World AI

July 30, 2025

Key Takeaways

Industry-leading accuracy: webAI's Knowledge Graph RAG achieved 94%+ accuracy on RobustQA, significantly outperforming current state-of-the-art benchmarks.
Advanced retrieval approach: By eliminating traditional chunking methods, webAI’s system maintains complete semantic coherence, greatly enhancing accuracy and reducing storage complexity.
Real world multimodal advantage: Unlike text-only systems, webAI’s Knowledge Graph RAG excels at processing complex visual documents (diagrams, charts, tables), addressing realistic enterprise challenges effectively.
Direct business impact: Superior accuracy translates directly into improved decision-making, reduced operational errors, and increased productivity in complex industries like aviation, healthcare, and manufacturing.

RobustQA as (Valuable!) Interim Validation

This is the second post in our Knowledge Graph RAG series.‍

Post 1 introduced how traditional text-only RAG fails on messy enterprise documents, and showcased webAI’s proprietary vision-plus-language knowledge graph solution.

Coming next, we take our KG RAG out of the lab and onto the factory floor, publishing industry-specific comparisons (manufacturing, aviation, healthcare) against leading LLMs and enteprise AI solutions.

Here in post 2, we detail our 94% accuracy win on the RobustQA benchmark, confirming webAI’s performance advantage even on pure text retrieval tasks.

‍

RobustQA represents one of the industry's most rigorous benchmarks for evaluating text-based retrieval augmented generation (RAG) systems. Used across the AI research community to assess how effectively systems can retrieve and utilize relevant information to answer complex questions, RobustQA provides a standardized framework for measuring retrieval accuracy and generation quality.

Our preliminary results demonstrate 94%+ accuracy on RobustQA, significantly surpassing current state-of-the-art solutions. These initial findings represent a conservative validation of our Knowledge Graph (KG) RAG approach and strongly suggest even greater performance potential as we scale our testing.

What This Means for Customers

This benchmark performance translates directly into tangible business value. Organizations using our KG RAG solution experience dramatically improved information retrieval accuracy, leading to faster decision-making, reduced operational errors, and enhanced productivity across knowledge-intensive workflows.

Consider a real world scenario: An aviation maintenance team needs to quickly locate specific procedures across thousands of technical manuals. Traditional RAG systems often fragment critical information during processing, leading to incomplete or inaccurate responses. Our approach preserves document integrity, ensuring maintenance personnel receive complete, contextually accurate guidance—potentially preventing costly delays or safety issues.

Industries experiencing the most significant impact include Aviation, Manufacturing, Healthcare, Legal Services, and Financial Services—sectors where document complexity and accuracy requirements are paramount. Early customer feedback consistently highlights our system's ability to handle intricate technical documentation while maintaining precision that traditional solutions simply cannot match.

Understanding Our Initial Testing Approach

Our RobustQA evaluation methodology differs fundamentally from conventional approaches, addressing a critical limitation that has plagued traditional RAG systems.

The RobustQA dataset consists of documents that are already quite small—typically paragraph-sized excerpts or smaller passages extracted from real world PDFs. Each document contains a "text" field representing an unstructured block of natural language. However, the standard RobustQA approach takes these already-small documents and further chunks them into even smaller segments of maximum 100 tokens each, fragmenting context and adding unnecessary complexity. The traditional HIT@5 metric then measures whether the correct answer can be generated from any of the top five retrieved chunks.

Here's where our approach diverges significantly: We treat each original document (passage) as a complete retrieval unit, eliminating the need for further chunking entirely. Our KG RAG pipeline recognizes that these documents are already appropriately sized and doesn't fragment them further.

From an engineering perspective, our approach searches within the top 5 complete documents/passages to find answers, rather than searching through artificially fragmented 100-token chunks. This dramatically reduces storage requirements while eliminating noise introduced by over-segmentation of content that's already at an optimal size. Our HIT@5 metric refers to whether the complete answer exists within any of the top five entire documents—preserving semantic coherence that chunking destroys.

We have tested our approach across the complete RobustQA dataset, with results showing consistently stable, high accuracy performance. Our comprehensive evaluation demonstrates the robustness of our method across diverse document types and query complexities.

This approach delivers significant efficiency gains. By avoiding unnecessary fragmentation of already-appropriate document sizes, we reduce storage overhead while maintaining superior retrieval accuracy. Traditional systems create artificial boundaries within coherent passages, often splitting critical context that our method preserves intact.

RobustQA's Limitations vs. Our Real World Strengths

While RobustQA provides valuable validation, we must acknowledge its inherent limitations. The benchmark evaluates only text-based retrieval, which doesn't capture the full scope of our KG RAG capabilities.

RobustQA cannot evaluate multimodal retrieval—our system's core differentiator. Real world documents contain complex visual elements: technical diagrams, data tables, charts, and interconnected multimedia content. Traditional text-only approaches fragment these relationships, losing critical contextual information.

The technical difference is profound: Text-only retrieval systems process documents linearly, often missing the spatial and visual relationships that convey meaning in technical documentation. Our multimodal approach maintains these relationships, enabling more accurate interpretation of complex materials where text, images, and structured data work together to convey complete information.

This limitation makes our upcoming real world tests particularly important—they will demonstrate our full multimodal advantage in scenarios that more accurately reflect actual enterprise document environments.

Ongoing and Planned RobustQA Validation

Having demonstrated strong performance across the complete RobustQA dataset, we're now implementing additional validation layers to reinforce and expand upon these comprehensive findings.

Our systematic approach includes cross-validation with alternative question sets, stress testing under various retrieval scenarios, and comparative analysis against other benchmark datasets. Each validation layer is designed to confirm the consistency of our high-accuracy performance across different evaluation frameworks.

From a technical process standpoint, we're conducting iterative testing cycles that extend beyond RobustQA to include domain-specific benchmarks while maintaining rigorous evaluation standards. This methodical approach ensures that our 94%+ accuracy represents sustainable performance across diverse evaluation contexts.

These expanded validation efforts will provide deeper insights into our system's performance characteristics while establishing comprehensive evidence of our approach's superiority over traditional chunking-based RAG architectures.

What's Coming Next: Real World, Multimodal Demonstrations

While RobustQA validation is crucial, our upcoming content will showcase where our KG RAG solution truly excels: real world, multimodal document processing.

Future posts will feature direct head-to-head comparisons against leading industry solutions, including ChatGPT O3, Claude, and specialized enterprise RAG platforms. These comparisons will demonstrate our performance advantages in realistic scenarios.

For example, in a recent test using F-18 technical flight manuals—documents rich with diagrams, tables, and complex technical specifications—our webAI KG RAG achieved 95% accuracy compared to ChatGPT O3's 80% accuracy. This 15-point advantage reflects our system's ability to process and interpret multimodal content that traditional text-based approaches cannot effectively handle.

Visualized context connections in a technical manual

We're preparing additional industry-specific demonstrations across aviation maintenance, manufacturing protocols, healthcare documentation, and legal case analysis. Each will illustrate how our unique multimodal capabilities translate into measurable performance improvements for real enterprise workflows.

Visualizing Our Technical Approach

To provide clearer insight into our methodology and enhance technical credibility, upcoming posts will include comprehensive visual documentation of our testing processes.

These assets will feature:

Screenshots from our Navigator and Companion products demonstrating retrieval setup and execution
Screen recordings of actual testing procedures and results visualization
Visual comparisons showing our approach versus traditional chunking methods
Diagrams illustrating our page-based encoding strategy and document processing workflow

‍

The retrieval and execution setup in Navigator

These visuals will clearly demonstrate our products performing technical retrieval tasks in real world scenarios, providing transparency into our methodology while showcasing the practical implementation of our KG RAG approach.

Building Confidence Through Clarity

Our initial 94%+ accuracy on RobustQA strongly validates the effectiveness of our KG RAG solution and our fundamental approach of eliminating chunking-related noise. This benchmark success, achieved through our innovative page-based encoding strategy, demonstrates measurable advantages over traditional RAG architectures.

RobustQA provides legitimate interim validation while we prepare more comprehensive real world demonstrations. Our ongoing validation efforts will further reinforce these results, building toward conclusive evidence of our system's superior performance across diverse document types and enterprise scenarios.

The combination of strong benchmark performance and our unique multimodal capabilities positions us to deliver transformative value for organizations dealing with complex, information-rich documentation.

Want to learn more? Check out the video about our RobustQA benchmarking efforts. ‍

‍A Brief Note About System Requirements

Our webAI KG RAG solution is engineered for maximum accuracy in multimodal enterprise document processing to deliver uncompromising precision. This setup requires at least 128GB RAM, though we recommend 256GB unified memory to ensure optimal performance. The knowledge graph could consume several gigabytes of storage depending on your document corpus size and number of documents. This configuration delivers the precision that mission-critical technical documentation demands, particularly when processing complex multimodal content.

For low-memory and low-storage setups, we offer options that can achieve the same goal with a slight loss in accuracy. Please contact us for more information.

‍Sign Up Now for Upcoming KG RAG Webinar

Ready to see how our KG RAG solution can transform your organization's document processing capabilities? ‍

Sign up for our upcoming webinar!

How Leading Manufacturers Are Using Private AI and Knowledge Graph RAG to Power SOPs, QA, and Inspections

August, 27th, 2025 at 2pm EST

The webinar will feature live demonstrations, explore real world enterprise applications, and show off detailed head-to-head comparisons with leading AI solutions. Experience firsthand how our KG RAG approach transforms complex document processing across industries.