ERP & Automation

AI Extraction Accuracy: Why “99%” Is the Wrong Question

By Robin Maier June 16, 2026 6 min read

Anyone shopping for automated document capture software runs into the same number everywhere: “up to 99% accuracy.” It’s on the websites of practically every vendor, it sounds reassuring — and it answers a question that is almost meaningless for actual business operations.

Because the question that determines whether you can trust an extraction system is not: How often is the system right? It is: How does the system know when it’s wrong? This article explains why the 99% rhetoric deceives in three ways — and what quality target belongs in its place.

Deception 1: field accuracy is not document accuracy

The advertised accuracy almost always refers to individual fields: order number correct, date correct, line 3 quantity correct. But a business document consists of many fields — a purchase order with eight line items quickly adds up to 30 to 50 extracted values. And errors compound multiplicatively:

At 99% field accuracy and 30 fields, a document is only about 74% likely to be completely error-free (0.99³⁰).
At 97% field accuracy — a perfectly respectable figure on real documents — that drops to around 40%.

Put differently: a system advertising “99% accuracy” can be measured entirely honestly and still deliver every fourth document with at least one error. But for accounting or order intake, what counts is the document, not the field — a purchase order with 29 correct values and one wrong one isn’t a 97%-correct order. It’s a wrong one.

Deception 2: benchmark documents are not your documents

Accuracy figures come from measurements on evaluation datasets — often clean, well-structured standard layouts. Real inbound documents look different: multi-line item descriptions, tables running across page breaks with carry-over subtotals, two “total” lines (with and without VAT), discounts in footnotes, Swiss number formats with apostrophes as thousands separators, the occasional crooked scan.

The drop from benchmark to live operation is regularly in the double digits — not because the models are bad, but because the measurement conditions have little to do with your own document mix. The consequence for any evaluation: the only accuracy that counts is measured on your own documents — against a gold set of real documents with manually defined target results, including senders the system never saw during setup. Everything else is brochure poetry.

Deception 3: the most dangerous error looks correct

Classic OCR errors were often visible — a mangled character, an empty field. Modern language models enable a different, more insidious class of error: plausibly wrong values. 12.80 becomes 12.08; an item number gets “corrected” by one digit; a missing quantity is filled in with a likely one. Nothing about it looks wrong. That’s precisely why such errors are practically impossible to find by eyeballing — and a system that offers only the model’s confidence score as protection has no protection: hallucinations often arrive with high confidence.

The right question: how does the system detect its own errors?

From these three observations follows the quality target that belongs in production. It is not “100% raw accuracy” — that is structurally unattainable on free-form layouts, and as a promise it’s a red flag. It is:

Zero undetected errors among the documents that are transferred automatically.

Errors are allowed to happen. They are just not allowed to happen unnoticed. This is achievable not through a better model, but through architecture — three mechanisms that work independently of the model:

Deterministic arithmetic checks. Business documents carry their own checksum: quantity × price less discount must equal the line total, and the line totals must match the document total. Transposed digits violate these invariants almost every time — math finds them more reliably than any review. The precondition: the model only transcribes what’s in the document; all calculation happens outside the model, in ordinary, testable code.
Grounding against the source. Every extracted value must be traceable, verbatim, to the raw text of the document. Whatever can’t be substantiated counts as not extracted. This catches hallucinations even in fields that can’t be recalculated — addresses, descriptions, dates.
Routing instead of hope. Documents that pass all checks flow through automatically. All others land in a review queue — pre-filled, checked in seconds. The honest operating metric is therefore not “accuracy” but the straight-through processing rate at zero undetected errors: typically 70 to 90 percent at the start, rising with every correction that flows back.

What this architecture looks like as a whole is covered in the guide Getting PDF documents into your ERP automatically; the deep technical dive follows in Anatomy of a document pipeline.

Six questions for every vendor (or for your own project)

If you’re evaluating extraction solutions, replace the accuracy question with these six:

What does the accuracy figure refer to — fields or complete documents? Measured on which dataset?
How will it be measured on our documents? Is there a pilot with a gold set of real documents — including senders the system doesn’t know?
Which deterministic checks run after extraction? Arithmetic, master data matching, source traceability?
What happens to uncertain documents? Is there a review queue, and how do corrections flow back into the system?
How is performance measured in production? Straight-through rate, errors per sender, undetected errors among automatically transferred documents?
What does an undetected error cost us? (This question goes to your own company — it determines how conservatively the thresholds should be calibrated.)

A vendor — or an internal project team — with precise answers to these questions deserves trust. One that points to the 99% does not, yet.

Conclusion

“99% accuracy” is a marketing metric: measured on someone else’s documents, referring to fields rather than documents, and blind to the error class that really hurts — plausibly wrong values. Document automation becomes dependable through a different target: zero undetected errors among the documents transferred automatically, achieved through deterministic validation, source grounding, and clean routing. Measure it that way, and you can stand behind the automation — in front of management, auditors, and your own sales support team.

kitun builds document pipelines on exactly this principle: the model transcribes, deterministic code verifies, uncertain documents go to humans — on-premise and without per-document pricing. What that looks like on your own documents is settled in a 20-minute intro call.

→ The solution at a glance: the kitun document pipeline