Curious about what keeps experts, CEOs and other decision-makers in the Intelligent Document Processing (IDP) space on their toes? Get food for thought on IDP-related topics from the industry’s leading minds.
In this opinion piece, Christopher Helm, CEO of Helm & Nagel GmbH with its Intelligent Document Processing (IDP) solution Konfuzio, takes an in-depth look at what the validation gap in document automation software is and how language models are setting new standards.
Walk into a real estate financing department that deployed automation software two years ago. They’re still manually reviewing every property valuation, every income verification, every title search. Visit an insurance underwriting team, extracted policy terms still require human verification before binding. Watch a procurement team handling industrial material specifications, extraction is fast, but every tensile strength value, every compliance certificate, every chemical composition gets checked manually. The automation paradox persists across every complex document domain: we extract everything, yet verify everything.
Why? Because confidence ≠ correctness. A system can report 99% confidence while being 100% wrong. And because the cost of a single false positive, financing a property with a fabricated appraisal, accepting non-compliant materials, issuing a policy with misunderstood exclusions, dwarfs the cost of manual review.
The breakthrough isn’t eliminating human involvement. It’s eliminating the verification bottleneck, so humans can focus on what AI cannot: judgment calls that require business context, risk appetite, and strategic decisions. When documents are validated as they’re read, every employee becomes a decision-maker instead of a data checker. Time from document receipt to action collapses.
But this shift requires solving a problem the industry has been ignoring.
The Real Problem Isn’t Extraction
A model can confidently extract “tensile strength: 470 MPa” from a material certificate. Perfect bounding box, clean OCR, 0.97 confidence score. One problem: this steel grade’s specification requires a range of 470-630 MPa (a measure of how much stress the material can withstand before breaking), and the document states “minimum 470 MPa”, but doesn’t state the actual tested value. The model extracted the specification limit, not the test result. It extracted accurately but failed completely.
This is the validation gap. OCR-Software provides you just with the text. Extraction tells you what’s written. Validation tells you whether it makes sense in context.
Documents don’t exist in a vacuum. A property appraisal connects to comparable sales data, tax assessments, previous appraisals, neighborhood trends, and lending guidelines. A material certificate links to purchase specifications, quality standards, previous shipments, and application requirements. An insurance application references medical records, claims history, policy terms, and underwriting rules.
The question isn’t “Did we find all the fields?” but “Is this business-logically sound?”
Three Dead Ends
Hard-coded rules: “If material grade X, then tensile strength must be between Y and Z.” But material properties vary by manufacturing process, heat treatment, and application context. One hundred material types, each with conditional specifications based on intended use, require thousands of rules. The deeper problem: rules only encode what you explicitly anticipate. They fail on the long tail, which is 80% of volume.
Confidence thresholds: “If confidence < 85%, human review is needed.” Confidence scores measure model certainty, not data validity. A model can be highly confident that an appraisal says “property value: €450,000” while missing that this is the assessed tax value, not the market appraisal value. High-confidence errors pass through. Low-confidence correct extractions clog review queues.
Post-extraction validation: “Extract first, validate later.” Context gets lost at every handoff. The model that reads the property appraisal doesn’t know that comparable sales in the document are from 18 months ago, and that neighborhood has appreciated 22% since then. The system that knows market trends can’t see which comparables the appraiser actually used.
All three treat validation as an afterthought to extraction, not part of understanding.
The Semantic Shift
Language models changed the game by making validation native to extraction. The key difference: smart IDP vendors can now perform the reasoning a domain expert would perform.
Old approach: Extract “property value: €450,000” → done. A human reviews it later against comparables they manually look up.
New approach: While reading the document, the system sees the three comparables the appraiser cited, knows current market appreciation rates for that postal code, performs the same adjustment calculations a reviewing appraiser would do, and determines whether the methodology is sound, all during extraction, not after.
This is possible because language models can:
Perform domain reasoning: “Comparable 1 sold for €420,000 twelve months ago and is 15% smaller. This postal code appreciated 8% in that period. Size-adjusted value: €420k ÷ 0.85 = €494k. Time-adjusted: €494k × 1.08 = €533k. Wait, that’s higher than the appraised value, not lower. The appraiser may have under-adjusted. Flag this comparable for review.”
Aggregate context from multiple sources: “The certificate shows tensile strength test result of 515 MPa. Specification requires minimum 470 MPa, passes. But I also have access to this supplier’s last 10 shipments: they averaged 580 MPa with σ=15 MPa. This result is more than 4 standard deviations below their norm. Still within spec, but represents unusual performance degradation. This is worth noting.”
Detect temporal patterns: “Application submitted Tuesday. Medical records show diagnosis code entered in the system the previous Friday. Questionnaire asks ‘any diagnosis in past 6 months’, applicant answered ‘no.’ Three-day gap between diagnosis and application suggests either the applicant doesn’t know yet (medical results pending) or is deliberately omitting recent information. Either way, requires underwriting review before proceeding.”
The architectural difference: validation logic isn’t hard-coded rules or post-processing checks. It’s reasoning that happens while reading the document, using the same contextual knowledge a human expert would use.
This is why it’s called semantic intelligence: the system understands what the numbers mean, not just where they appear on the page.
The Complexity Frontier
What’s changed in 2026 isn’t just that we can validate better, it’s that we can validate more dimensions simultaneously.
Traditional: “Does the extracted material grade match the purchase order?”
Contemporary: “Does the extracted material grade match the purchase order, and are the stated material properties consistent with that grade’s specification, and are test results within expected variance for this manufacturer, and do the testing standards cited match the application requirements, and is the certifying laboratory accredited, and does the certificate date align with shipment date, and are there any deviations noted that affect fitness-for-purpose?”
This requires simultaneous access to the document, material specifications, supplier quality history, standards databases, laboratory accreditation lists, application requirements, and historical test data.
This isn’t validation as quality control. This is validation as intelligence, detecting counterfeit certificates, catching specification drift, identifying supplier quality issues before they cause failures.
Real impact: A manufacturer discovers that a supplier’s material certificates show test results clustering at exactly the minimum specification values across 40 shipments, statistical impossibility for real testing, clear indicator of fabricated certificates. A real estate lender identifies that an appraiser consistently uses comparables from a specific date range that maximizes property values, avoiding more recent (lower) sales. An insurer detects that applications from a particular broker consistently omit specific pre-existing conditions that appear in medical records.
These aren’t single-document anomalies. They’re behavioral patterns that emerge when you validate documents in the context of everything you know.
The Data Grounding Problem
But here’s what the prompt engineering narrative misses: the best prompt is useless without grounded data.
You can write: “Check if this material’s tensile strength is within specification for the intended application.” But if your system doesn’t have the material specifications database, or doesn’t know the intended application, or can’t access historical test data for comparison, the prompt fails silently.
This is the uncomfortable truth: it’s still a data problem. Just a different data problem.
We’ve moved from an annotation problem to a grounding problem.
Consider what “validate this material certificate” requires:
- Standards data: What are the specification ranges for this material grade? Which testing standards apply?
- Application data: What is this material being used for? What properties are critical for that application?
- Supplier data: What’s this supplier’s quality history? What’s their typical test result variance?
- Accreditation data: Is the testing laboratory certified? For which standards?
- Historical data: How do these test results compare to previous shipments?
A document agent can’t validate anything meaningful without this data accessible, queryable, and structured.
Grounded data means:
- Accessible: The agent can query it during extraction, not just after
- Consistent: Material “S355J2” isn’t sometimes “S355 J2” or “S355-J2” with different specifications
- Current: Standards evolve, ISO 9001:2015 isn’t the same as ISO 9001:2008
- Relational: The agent can traverse from certificate → material grade → specification → application → acceptance criteria
- Contextual: Not just “tensile strength minimum 470 MPa” but “tensile strength minimum 470 MPa per EN 10025-2 for structural applications, minimum 510 MPa for pressure vessel applications per EN 10028-2”
Most enterprise data fails at least three of these criteria.
The competitive advantage in 2026 isn’t better prompts. It’s better data infrastructure. Organizations achieving 80%+ straight-through processing on complex documents aren’t the ones with clever prompts, they’re the ones whose specifications databases, quality management systems, and supplier data are clean, integrated, and accessible to document agents.
The prompt is the interface. The data is the intelligence.
You can have a mediocre prompt and excellent grounded data, and catch most errors. You cannot have an excellent prompt and poor data grounding, and expect anything meaningful.
Semantic document processing doesn’t eliminate the data problem. It exposes it. When your validation prompt says “check if this appraisal methodology is appropriate” and discovers that comparable sales data is in broker emails, market trend data is in analyst PDFs, and lending guidelines are in policy documents with no structured link to property types, you haven’t solved document processing. You’ve discovered your data infrastructure isn’t ready for intelligent automation.
The winners in document automation aren’t the ones with the best AI. They’re the ones who’ve done the unglamorous work of structuring specifications, integrating quality systems, standardizing supplier data, and building the grounding layer that makes intelligent validation possible.
What Actually Changes
Ten years ago, encoding “check if this material’s test results align with historical supplier performance and flag unusual degradation patterns” required custom code: database queries, statistical calculations, business rule engines. You needed a developer to translate domain expertise into software.
Now you describe it: “Compare this shipment’s test results to the last 10 shipments from this supplier. Calculate mean and standard deviation. If this result exceeds 3 standard deviations below the mean, flag for review with context showing the historical range and the magnitude of deviation.”
The validation logic a procurement engineer carries in their head can be written as instructions, not code. If you have the right tool to access grounded data, supplier history, specifications, standards, the system can reason over it the way the engineer would.
This is why the data problem is the real problem. The reasoning capability exists. The ability to express complex validation logic in natural language exists. What doesn’t exist in most organizations is clean, accessible, structured data for that reasoning to operate on. For the first time, domain experts can work independently, if, and only if, someone has built the data infrastructure to support them.
The question that matters: Is this document business-logically correct?
That question can’t be answered by layout analysis. It requires knowing what the numbers mean in their domain context, how they relate to specifications and requirements, whether they cohere with everything else the business knows, and whether patterns across documents reveal risks invisible in single transactions.
The gap between extraction and trust closes not when we read documents better, but when we understand what they mean, and what they mean in the full context of domain knowledge, standards, history, and relationships.
The validation problem hasn’t gotten simpler. But for organizations willing to structure their data, something remarkable becomes possible: translating internal process guidelines directly into document intelligence.
A procurement engineer who knows “we flag any shipment where test results are more than 3 standard deviations below supplier norms” can now write exactly that. A real estate underwriter who knows “we require comparable sales within 6 months and adjusted for size differences over 10%” can describe that logic in a paragraph. An insurance underwriter who knows “applications submitted within one week of a new diagnosis code require additional medical review” can encode that institutional knowledge as instructions.
The document agent that emerges isn’t generic AI. It’s your organization’s validation logic, the expertise that lives in procedure manuals, training documents, and experienced employees’ heads, made executable. Not because the AI is smarter, but because you finally have the data infrastructure and the interface (natural language) to make that expertise operational.
Prompts are cheap. Grounded data is expensive. But grounded data plus institutional knowledge expressed as instructions, that’s what separates working systems from impressive demos. And for the first time, the people who understand the domain can build those systems themselves.

About the Author
Christopher Helm is CEO of Helm & Nagel GmbH, the company behind Konfuzio. His work focuses on cognitive automation in complex domains, moving from extracting data to understanding whether it’s valid.
Click here to find more news from Konfuzio.
📨Get IDP industry news, distilled into 5 minutes or less, once a week. Delivered straight to your inbox ↓
