Curious about what keeps experts, CEOs and other decision-makers in the Intelligent Document Processing (IDP) space on their toes? Get food for thought on IDP-related topics from the industry’s leading minds.
In this opinion piece, Souvik Mandal, Deep Learning Engineer at Intelligent Document Processing (IDP) vendor Nanonets, shares his insights on the recently published benchmark study – the Intelligent Document Processing (IDP) Leaderboard.
In the past year, the pace of progress in Vision-Language Models (VLMs) has been extraordinary. From increasingly accurate multimodal data extraction to significant advances in reasoning across complex document formats. Yet, despite this rapid innovation, there has been a persistent blind spot in our evaluation ecosystem: document understanding. To address this gap, the Intelligent Document Processing (IDP) Leaderboard was introduced—a unified multi-task benchmark developed in collaboration with the Indian Institute of Technology Indore and sponsored by Nanonets. IDP Leaderboard is now publicly available at idp-leaderboard.org
What Sets the Evaluation Apart
The Intelligent Document Processing (IDP) Leaderboard stands out as the most comprehensive benchmark for Vision-Language models in the IDP domain. It evaluates model performance across six fundamental tasks: Optical Character Recognition (OCR), Key Information Extraction (KIE), Document Classification, Visual Question Answering (VQA), Table Extraction, and Long Document Processing. The IDP Leaderboard will soon include confidence score calibration as an additional evaluation task.
With coverage across 16 diverse datasets and more than 9,000 documents, the benchmark is designed to reflect the full spectrum of real-world challenges in document understanding. Each task is evaluated using multiple datasets—including public sources, synthetic collections, and newly annotated samples—ensuring both depth and breadth. Scores are then aggregated to provide a holistic view of a model’s performance across various aspects of each task.
Some Surprises From the Results
- Gemini 2.5 Flash leads overall, but surprisingly underperforms its predecessor on OCR.
- All models struggled with long document understanding – the top score was just 69.08%.
- Table extraction remains a bottleneck, especially for long, sparse, or unstructured tables.
- Surprisingly, GPT-4o’s performance decreased in the latest version (gpt-4o-2024-11-20) compared to its earlier release (gpt-4o-2024-08-06).
- Token usage (and thus cost) varies dramatically across models — GPT-4o-mini was the most expensive per request due to high token usage.
The Road Ahead: Evolving Benchmarks for a Smarter Document AI Future
The IDP Leaderboard is not a final destination but a launchpad for continuous innovation in document AI. As part of its next phase, the leaderboard will introduce a confidence score calibration task, enhancing the ability to assess model reliability and uncertainty. In addition, new models will be added to the leaderboard, further expanding the range of solutions evaluated and pushing the boundaries of what’s possible in document AI. This evolution ensures that the benchmarks remain aligned with the growing complexity of real-world tasks and the need for more precise, adaptable models.
If you’re interested in exploring the benchmark itself, the datasets, or the code, everything is available here: idp-leaderboard.org For a detailed overview, you can also read the release blog.

About the Author
Souvik Mandal is a Deep Learning Engineer at Nanonets, specializing in large language models (LLMs), vision-language models (VLMs), and information extraction. His current interest is training small multimodal models on large datasets.
Click here to find more news from Nanonets.
📨Get IDP industry news, distilled into 5 minutes or less, once a week. Delivered straight to your inbox ↓