November 2025
YBA MAGAR achieves 59.08% answer correctness on the MuSiQue benchmark—matching Microsoft’s fine-tuned model and outperforming Salesforce, Google, Huawei, and Peking University. With the highest Exact Match (53.2%) and F1 (69.5%) scores, MAGAR sets a new standard in retrieval-augmented reasoning for enterprise AI.
We introduce MAGAR (Multi-Agent Graph-Augmented RAG), a retrieval-augmented generation framework that combines graph-based retrieval with multi-agent orchestration to support robust, multi-step in-context reasoning over company knowledge. To assess MAGAR’s generality for multi-hop reasoning, we benchmarked it on MuSiQue, a public multi-document multi-hop question-answering dataset. This report presents the evaluation results drawn from our experiment materials, explains the evaluation protocol, and provides an appendix with reproducibility notes. All numeric results in this report are taken from the supplied evaluation materials and have not been altered. These results confirm MAGAR’s effectiveness in retrieval-augmented reasoning and position YBA among the leaders in multi-hop question answering performance.

Figure : Comparative performance of YBA RAG (MAGAR) against leading In Context systems
YBA.ai builds in-context agents that automate knowledge work for go-to-market teams. Our MAGAR (Multi-Agent Graph-Augmented RAG) technology combines graph-based retrieval with multi-agent orchestration to deliver robust, multi-step reasoning and evidence-backed answers from a company’s data and knowledge bases.
Introduction
Enterprise GTM teams increasingly rely on accurate, evidence-backed answers drawn from internal documentation (handbooks, playbooks, product docs, CRM notes). Multi-hop questions — those that require linking facts across several documents and performing intermediate reasoning — remain a major challenge for standard retrieval-plus-generation pipelines.
MAGAR was developed to address this: it augments vector retrieval with a graph representation of knowledge and coordinates multiple specialized agents to produce grounded answers with provenance. MuSiQue is a relevant public benchmark for multi-hop QA; we used it to validate MAGAR’s ability to chain evidence and produce correct answers across documents.
This dataset is perfect for validating MAGAR because it rigorously probes complex reasoning abilities. Unlike simple Q&A, a MuSiQue question requires the system to:
This need to integrate evidence and maintain sequence directly aligns with MAGAR's core strengths: modeling relationships between information chunks and preserving coherent task sequences via its graph-based retrieval.
Link to Dataset: https://arxiv.org/abs/2108.00573
To ensure an objective and comprehensive assessment of MAGAR’s performance, we evaluated the system using standard metrics widely adopted in Retrieval-Augmented Generation (RAG) research
Evaluation Metrics:
precision = matched tokens / generated tokens
recall = matched tokens / correct tokens
F1 Score = 2 x (precision x recall) / (precision + recall)
We tested MAGAR against the MuSiQue development set using two scenarios to ensure comprehensive validation and confidence in the results:

The benchmark results show that our technology achieved an Answer Correctness of 46.50% on the full 1,127-question evaluation, with an Exact Match of 36.29% and an F1 Score of 53.30%. On the random 500-question subset, performance improved to 59.08% Answer Correctness, 53.20% Exact Match, and 69.50% F1 Score, indicating better accuracy and completeness on a smaller evaluation set.
Following are the details of evaluation strategies used by others
Microsoft - PIKE RAG:
Google - Speculative RAG:
Salesforce - GPT-4o RAG + HyDE:
Peking University - HopRAG
Huawei - GeAR:
Following table shows the comparison of our RAG against others

By comparing YBA RAG (MAGAR) with other retrieval-augmented generation systems, we observe that it achieves one of the highest overall performances on the MuSiQue benchmark. With an answer correctness of 59.08%, YBA’s model performs nearly on par with Microsoft’s fine-tuned PIKE RAG (59.60%), while surpassing Salesforce’s GPT-4o RAG + HyDE (52.20%), Google’s Speculative RAG (31.57%), Peking University’s HopRAG (42.2% EM, 54.9 F1), and Huawei’s GEAR (19% EM, 35.6 F1).Notably, YBA MAGAR achieves the best Exact Match (53.2%) and F1 score (69.5%) among all models, demonstrating superior consistency between retrieved context and generated answers. This indicates that MAGAR’s multi-agent retrieval mechanism effectively enhances answer accuracy and contextual alignment.
Please refer to the chart results comparison shown above.
Note : All benchmarking results are derived from the MuSiQue development dataset and verified using standard RAG evaluation metrics.