Production document AI for invoices and contracts: OCR, layout-aware encoders, calibrated per-field extraction, human review routing, and template drift. Covers hybrid rule/ML cascades, per-field F1 under imbalance, and audit-grade logging — the system design interviewers expect beyond flatten-all-text BERT.

58 min read 2 sections 1 interview questions

Document AIOCRLayoutLMLayoutLMv3Named Entity RecognitionMulti-label ClassificationIsotonic RegressionHuman-in-the-LoopTable ExtractionPDF PipelineActive LearningF1 ScoreTemplate DriftSpan Extraction

Why Document Understanding Is a Systems Problem, Not a Fine-Tuned BERT

Candidates often jump to: We run OCR, chunk text, embed with a transformer, and classify. That misses the dominant production constraints: layout, tables, handwriting and scan noise, long documents beyond context windows, hierarchical schemas (line items nested under invoices), and legal-grade audit trails for every field extraction.

Layout matters more than most NLP prep covers. The same words in different spatial positions mean different things on a tax form. Production stacks therefore separate visual parsing (detecting blocks, tables, checkboxes) from semantic labeling (mapping spans to schema keys). Academic and industrial document understanding models (layout-aware transformers, graph-based parsers) exist because token order from left-to-right OCR destroys structural cues that humans use instantly.

Multi-field extraction is multi-task with asymmetric cost. Missing a total amount on an invoice may be worse than misreading a memo line. Interviewers expect per-field precision/recall targets, not a single document-level accuracy number. Calibration matters because confidence gates route uncertain extractions to expensive human reviewers — uncalibrated softmax scores silently waste reviewer time or auto-approve errors.

Template and vendor drift breaks naive supervised models quietly. When a supplier changes PDF layout, F1 on the aggregate dashboard can look stable while a specific field collapses. Strong designs version layout fingerprints, monitor slice metrics by vendor, and keep rules + ML hybrid fallbacks for v1 coverage.

References: Google Cloud Document AI documents OCR plus structured extraction; Amazon Textract describes similar pipelines; Xu et al., LayoutLM: Pre-training of Text and Layout for Document Image Understanding (arXiv:1912.13318) motivates why plain BERT on flattened OCR is a weak baseline for forms.

IMPORTANT

Premium content locked

This guide is premium content. Upgrade to Pro to unlock the full guide, quizzes, and interview Q&A.

Upgrade to Pro Sign in to upgrade