Building an OCR pipeline with AWS Textract: what the docs don't tell you

AWS Textract gets you from "PDF blob" to "structured text with layout information" faster than almost anything else. It's also opaque, inconsistently documented, and priced in a way that surprises you at the end of the first month.

Here's what I wish I'd known before integrating it into a production document intelligence pipeline.

The two Textract APIs are not interchangeable

Textract has two main APIs:

DetectDocumentText — synchronous, single page, text-only. Fast and cheap.
StartDocumentAnalysis / GetDocumentAnalysis** — asynchronous, multi-page, returns tables and forms in addition to text.

The async API is the one you actually want for production, but the sync API is what most tutorials use. The difference matters: the async API requires an SNS topic for completion notifications (or polling), returns a pagination token for large documents, and has different quotas.

We started with the sync API for prototyping, and the migration to async took longer than expected because the response schemas are meaningfully different.

Async means really async

StartDocumentAnalysis returns a job ID immediately. The actual analysis takes anywhere from 3 seconds to 3+ minutes depending on document size and AWS load. The docs say to poll GetDocumentAnalysis, but in practice, SNS notifications + SQS are the right architecture.

Our setup:

StartDocumentAnalysis with SNSTopicArn configured
Textract publishes to SNS when complete
SNS → SQS → Lambda processes the result

This decouples the upload path from the processing path and handles the variance in processing time gracefully.

Confidence scores are useful but not calibrated

Every text block Textract returns has a Confidence value (0–100). The intuition is right: high confidence means the text was clearly readable, low confidence means it was guessing. But the calibration is uneven.

In our testing:

Cleanly printed documents: confidence 98–99.9 for most blocks — reliable
Handwritten text: confidence wildly variable, often 60–80 for clearly readable text — not reliable
Scanned documents with noise: confidence can drop on perfectly legible text

We ended up using confidence as a routing signal rather than a quality gate. Below 85: flag for human review. Above 95: auto-accept. 85–95: run through LLM enrichment for validation.

FORMS vs. TABLES: know which feature you need

Textract has two analysis features beyond text extraction:

FORMS: Extracts key-value pairs from form-like documents (label: value pairs). Good for structured forms.
TABLES: Extracts tabular data with row/column structure.

You pay extra for both. If you don't need structured tables (you're just doing text extraction), disable TABLE analysis to cut costs.

The cost math

Textract pricing is per page, not per API call. At the time of writing:

Text detection: $1.50/1000 pages
Forms analysis: $50/1000 pages
Tables analysis: $15/1000 pages

Tables are cheap relative to forms. But if you're processing high volume, the forms pricing adds up fast. We disabled FORMS for documents where we knew the structure (the LLM handles structured extraction from raw text more cost-effectively than Textract FORMS for our use case).

Multi-page document handling

GetDocumentAnalysis returns results in pages of up to 1000 blocks. For a 50-page document with dense text, you may need to paginate through 5-10 result pages. The NextToken pagination is straightforward but easy to forget — if your first page looks right and you don't paginate, you'll silently drop content from long documents.

We wrapped Textract pagination in a utility that collects all blocks before returning:

async function getAllBlocks(jobId: string): Promise<Block[]> {
  const blocks: Block[] = [];
  let nextToken: string | undefined;
 
  do {
    const response = await textract.getDocumentAnalysis({
      JobId: jobId,
      NextToken: nextToken,
    });
    blocks.push(...(response.Blocks ?? []));
    nextToken = response.NextToken;
  } while (nextToken);
 
  return blocks;
}

Block reconstruction

The Textract response is a flat list of Block objects with relationships. A PAGE block contains LINE blocks. A LINE block contains WORD blocks. TABLE blocks have CELL children. Reconstructing readable text from this hierarchy requires traversing the relationship graph.

For our use case (feeding text to an LLM), we built a simple serializer that joins words into lines, lines into paragraphs, and outputs markdown-ish text with table sections delimited. The structure preservation helped the LLM extraction step significantly.

What we'd do differently

Start with async. The sync API is a local development convenience, not a production pattern.

Set up cost alerts immediately. AWS cost alerts on a per-service budget saved us from a surprise bill when a bug caused documents to be reprocessed in a loop.

Cache results aggressively. Textract results are deterministic for the same document. We store results in S3 keyed by document hash and skip re-processing if the hash exists. This eliminated redundant API calls during development and testing.