glossary

Document Data Extraction

AI document processing as a step in a governed workflow: OCR, LLM extraction, classification, rule-based reconciliation, and human review.

Document data extraction is producing structured fields (dates, amounts, IDs, line items) from unstructured files: invoices, acts, contracts, forms, IDs. Unlike plain recognition it is a multi-layer workflow: text recognition (OCR), field extraction by description (rules or LLM), document-type classification, reconciliation against rules and master data, and routing contested cases to a human.

Where the OCR vs LLM line sits: OCR turns an image into text and works on clean uniform forms — add rules and close the task without an LLM. LLM extraction is needed for heterogeneous documents: varying invoice templates, free-form text, atypical wording. Production runs a stack: OCR for the rough work on text, the LLM for the fine work on meaning, deterministic reconciliation to catch what shouldn’t be trusted. Trying to live on one layer alone is the common reason behind “our recognition works but accounting still gets errors”.

Human review isn’t an “optional feature”: every automated pipeline produces errors, the only question is whether you see them before they reach accounting. Hence the explicit confidence thresholds, manual review queue and per-document tracing of “which field was extracted how and from which line” — see AI document processing and process automation.

All terms