Automatic Data Extraction from Documents with Amazon Textract - OCR, Table Analysis, and Form Recognition
Learn how to extract text from documents, analyze table structures, and extract key-value pairs from forms using Textract.
Overview of Textract
Textract is an OCR service that automatically extracts text, tables, and form data from documents, supporting PDFs up to 3,000 pages and images up to 10 MB. While traditional OCR only recognizes text positions and characters, Textract understands table row-column structures and the relationships between form labels and values. The Queries feature extracts answers to specific questions from documents, and AnalyzeExpense structures invoice line items.
Table Analysis and Queries
The AnalyzeDocument API's Tables feature recognizes table rows and columns, returning cell contents as structured data. It correctly handles merged cells and header rows. The Forms feature automatically pairs form labels (such as "Name," "Address," "Phone Number") with their corresponding values. Queries lets you ask natural language questions about a document, extracting answers to questions like "What is the patient's name?" or "What is the total amount?" AnalyzeExpense specializes in receipts and invoices, automatically classifying vendor names, dates, total amounts, tax amounts, and line items.
AnalyzeExpense and Lending
The AnalyzeExpense API specializes in invoices and receipts, extracting vendor names, invoice dates, total amounts, and line items (item names, quantities, unit prices) as structured data. It handles handwritten receipts and multi-page invoices, making it ideal for automating expense reporting. The AnalyzeLending API specializes in loan documents such as mortgage applications, automatically classifying document types like application forms, income statements, and property appraisals before extracting fields from each. The asynchronous API (StartDocumentAnalysis) enables batch processing of large document volumes with results output to S3. Integration with A2I (Augmented AI) lets you route low-confidence extraction results to human review workflows. For a deeper understanding of OCR theory and implementation, specialized books on Amazon are a great resource.
Optimizing Textract Costs
Textract pricing is based on the API type and number of pages. DetectDocumentText (text extraction only) costs approximately $1.50 per 1,000 pages, AnalyzeDocument (table and form analysis) costs approximately $15, and Queries cost approximately $0.015 per query. AnalyzeExpense costs approximately $10 per 1,000 pages. Use DetectDocumentText when text extraction alone is sufficient, and only use AnalyzeDocument when table or form structure analysis is needed to optimize costs. Pre-processing documents to exclude unnecessary pages (blank pages, cover pages) reduces the number of pages processed. The asynchronous batch API has the same pricing as the real-time API but is effective for avoiding throttling during large-scale processing.
Summary
Textract is an advanced OCR service that goes beyond text extraction to understand table structures and form key-value pairs. Queries extract answers to specific questions from documents, and AnalyzeExpense structures invoice line items. AnalyzeLending automates classification and extraction of loan documents, and A2I integration routes low-confidence results to human review workflows.