Amazon Textract
A machine learning-based document analysis service that automatically extracts text, handwriting, tables, and form key-value pairs from documents
Overview
Amazon Textract is a machine learning service that automatically extracts text and structured data from scanned documents, PDFs, and images. Going beyond simple OCR (optical character recognition), it identifies table row-column structures, form label-value relationships, signature detection, and handwriting recognition. It also provides specialized APIs optimized for business documents: AnalyzeExpense for invoices and receipts, AnalyzeID for identity documents, and AnalyzeLending for loan-related documents.
Structured Data Extraction Beyond OCR
Traditional OCR reads text on a page from top to bottom, left to right, without understanding table row-column relationships or form label-value correspondences. Textract analyzes page layout and understands the spatial relationships between text blocks before extracting data. The DetectDocumentText API performs pure text extraction, returning results in a hierarchy of lines (LINE) and words (WORD). The AnalyzeDocument API supports detection of tables (TABLE), forms (FORMS), layout (LAYOUT), and signatures (SIGNATURES) - for tables, it returns structured data including cell row numbers, column numbers, and merge information. Form extraction automatically maps key-value pairs like 'Name: John Smith'. A confidence score is attached to each extraction result, enabling workflows that route low-confidence results to human review. Recognition accuracy for Japanese text tends to be lower than for English, and handwritten Japanese in particular sees a significant drop in recognition rates, so testing with sample documents beforehand is important.
Asynchronous Processing and High-Volume Document Pipeline Design
Textract offers two types of APIs: synchronous and asynchronous. Synchronous APIs return results in real time for single-page images (JPEG/PNG) but don't support multi-page PDFs. Asynchronous APIs (StartDocumentTextDetection / StartDocumentAnalysis) process multi-page PDFs stored in S3, handling up to 3,000 pages. Processing completion is detected via SNS topic notifications, and results are retrieved with GetDocumentTextDetection / GetDocumentAnalysis. For high-volume document processing pipelines, the standard architecture triggers a Lambda function on S3 upload to call Textract's asynchronous API, then a separate Lambda receives the SNS notification to retrieve and post-process results. Textract has per-region concurrent processing limits (default 25 asynchronous jobs), so when submitting large volumes of documents at once, you need to use an SQS queue for flow control. For post-processing, practical patterns include combining with Amazon Comprehend to automatically classify named entities (person names, organization names, dates, amounts) or integrating Amazon Augmented AI (A2I) for human review workflows. For a deeper understanding of document processing automation, books on OCR and document processing (Amazon) are a great resource.
Business Document APIs and Accuracy Improvement Techniques
The AnalyzeExpense API specializes in invoices and receipts, extracting vendor name, invoice date, total amount, tax amount, and line items (item name, quantity, unit price, subtotal) as standardized fields. It delivers significantly better field recognition accuracy and structuring quality than processing the same document with the general-purpose AnalyzeDocument API. The AnalyzeID API extracts name, date of birth, address, document number, and expiration date from driver's licenses and passports, directly enabling KYC (Know Your Customer) process automation. For accuracy improvement, input image quality matters most. Images with 150 DPI or higher resolution, deskewed, and with sufficient contrast yield better recognition accuracy. Textract's custom Queries feature lets you ask natural language questions like 'What is the invoice number?' or 'When is the payment due?' to pinpoint-extract specific information from documents. Pricing is per-page on a pay-as-you-go basis: DetectDocumentText costs 1.50 USD per 1,000 pages, while AnalyzeDocument (tables + forms) costs 65 USD per 1,000 pages - varying significantly by feature.