There's an intuitive assumption in the AI industry right now: if reasoning models are getting smarter at math, code, and logic, they should also be getting better at reading documents. Feed a complex financial statement into a model with maximum reasoning tokens, and it should extract every table cell, every line item, every figure with near-perfect accuracy.
The assumption is wrong.
The overthinking problem
Recent benchmarks evaluating vision-language models at varying reasoning levels reveal a counterintuitive finding: more reasoning tokens do not improve document parsing accuracy. In evaluations across complex documents — dense tables, multi-column layouts, mathematical formulas, mixed text orientations — quality stayed flat at approximately 79% regardless of whether the model used zero, low, high, or maximum reasoning.
But cost and latency did not stay flat. Maximum reasoning increased processing time by 5× and cost by 8.5× compared to zero reasoning — for the same accuracy.
| Approach | Quality | Time | Cost / page |
|---|---|---|---|
| Single model, no reasoning | ~79% | ~48s | $0.029 |
| Single model, max reasoning | ~79% | ~242s | $0.246 |
| Pipeline + agentic validation | 97%+ | <10s | $0.013 |
The third row isn't hypothetical. It's the architecture we built at DocuLexis — and it outperforms monolithic reasoning models on quality, speed, and cost simultaneously.
Four ways reasoning models break documents
When we studied the failure modes in detail, we found that reasoning doesn't just fail to help — it actively introduces errors that don't exist in the base model. Here's what goes wrong:
1. Table hallucination
When a model encounters blank cells in a table, the base model (zero reasoning) transcribes them as empty — which is correct. But the reasoning model decides those blanks are "mistakes" and fills them in with inferred values. An author's intentional shorthand becomes fabricated data. In one case, the model even mutated the characters it was reading — turning "4-CNY" into "4-CYN" — because it second-guessed what it saw with what it thought should be there.
The core insight
Document parsing is a perception task, not a reasoning task. The goal is faithful reproduction — transcribing what exists on the page. When you add reasoning to a perception task, the model starts overriding what it sees with what it believes should be there.
2. Structural splitting
A single continuous table with section header rows gets split into three separate tables by the high-reasoning model. The model encounters a row where some columns are empty (because it's a section divider) and reasons: "this must be a boundary between tables." The base model, which simply transcribes what's visually present, keeps the table intact.
More thinking leads the model to override what it sees with what it thinks should be there.
3. Vision encoder bottleneck
When text is small, dense, or vertically oriented, the vision encoder loses the information during the image encoding phase — before reasoning even starts. The model outputs "[Illegible]" regardless of how many reasoning tokens it uses. This is a hardware-level bottleneck: you can't reason your way past information that was never captured at the pixel level.
4. The parsing-vs-understanding trap
There's a crucial distinction between parsing ("what is written here?") and understanding ("what is this?"). Reasoning helps with understanding — it can describe logos, interpret visual elements, classify document types. But when applied to parsing, it backfires. Tables, formulas, and structured data need faithful transcription, not interpretation. Understanding fills in blanks. Parsing preserves them.
The pipeline alternative
The failure modes above share a common root cause: a single model is asked to handle vision, OCR, layout analysis, structural reasoning, and content extraction simultaneously. Every capability competes for the same token budget, and reasoning about structure contaminates faithful text extraction.
At DocuLexis, we decomposed this into specialized passes — each optimized for one job:
Pass 1 — Layout Detection. Vision models map the document into zones: tables, charts, text blocks, signatures, handwriting, stamps. Structural boundaries are established before a single character is read. This eliminates the "one table becomes three" problem entirely.
Pass 2 — Native OCR. Dedicated OCR engines read text at full pixel resolution within each zone. Small text, vertical orientation, dense tables — all captured without the compression loss that vision encoders introduce. This solves the "[Illegible]" problem.
Pass 3 — LLM Structuring. Language models organize the pre-extracted text into structured fields, tables, and hierarchies. The critical difference: the LLM works with text that has already been accurately read. It structures — it doesn't transcribe. This eliminates hallucination because the model can't override characters that were captured deterministically.
Pass 4 — Agentic Validation. Self-correcting agents cross-reference the structured output against raw OCR data. If the OCR read "4-CNY" and the structure says "4-CYN", the validator catches the mismatch. Anomalies are flagged, confidence scores are assigned, and low-confidence extractions are routed for human review.
The key insight: each component plays to its strengths. Vision models detect layout. OCR captures pixels. LLMs organize text. Agents validate integrity. No single model is asked to do everything.
Why this matters for enterprise
The "just use a bigger model" approach has three consequences that enterprise teams can't tolerate:
Cost. Maximum reasoning costs 8.5× more per page than zero reasoning — with no accuracy benefit. At enterprise volumes (millions of pages per month), that's the difference between a viable product and an unsustainable one.
Reliability. Hallucinated table cells and split structures create downstream errors that compound. A fabricated value in a financial statement propagates through every model, report, and decision that touches it. Faithful extraction isn't a nice-to-have — it's a compliance requirement.
Latency. 242 seconds per page isn't acceptable for real-time workflows. Claims processing, KYC verification, and clinical data extraction need results in seconds, not minutes.
A pipeline architecture delivers all three: higher accuracy at lower cost with faster throughput. Not because it uses better prompts, but because it uses better architecture.
What we're building
DocuLexis is the production implementation of this architecture. Our eight-stage pipeline — from ingestion through delivery — runs specialized components in sequence, with agentic validation loops at every critical junction. The result: 97%+ field-level accuracy, sub-10-second processing, and a cost structure that scales linearly with volume.
We didn't arrive at this architecture by theory. We arrived at it by watching reasoning models fail on the documents our customers care about most — financial statements with nested tables, insurance claims with handwritten adjuster notes, clinical reports with multi-language annotations. Every failure mode taught us where specialized components outperform general-purpose reasoning.
The lesson is clear: document parsing isn't a reasoning problem. It's an engineering problem. And engineering problems deserve engineered solutions.
See the pipeline in action
Upload a test document and watch DocuLexis extract, validate, and structure the data in seconds.
Explore the platform →