Using Textract with .NET

What Is Textract?

Amazon Textract goes beyond basic OCR. It extracts text, tables, forms (key-value pairs), and signatures from scanned documents and PDFs. It understands document structure. It knows that "Name:" and "John Smith" on a form are a key-value pair, not just two lines of text.

For .NET shops in healthcare, finance, insurance, and legal (industries that still drown in paper) Textract automates what used to require manual data entry or expensive enterprise OCR software.

When to Use Textract vs Rekognition

Rekognition DetectText: Simple text detection in photos (street signs, product labels). Returns text with location.
Textract DetectDocumentText: Full document OCR. Returns every line and word with geometry.
Textract AnalyzeDocument: Understands structure. Tables, forms, key-value pairs. The real power.

Use Textract when you need to extract structured data from documents. Use Rekognition when you need to find text in photos.

Basic Text Detection

using Amazon.Textract;
using Amazon.Textract.Model;

var textractClient = new AmazonTextractClient();

var response = await textractClient.DetectDocumentTextAsync(new DetectDocumentTextRequest
{
    Document = new Document
    {
        S3Object = new Amazon.Textract.Model.S3Object
        {
            Bucket = "documents-bucket",
            Name = "invoices/INV-2024-001.pdf",
        },
    },
});

// Extract all text lines
var lines = response.Blocks
    .Where(b => b.BlockType == BlockType.LINE)
    .Select(b => b.Text)
    .ToList();

Extracting Forms (Key-Value Pairs)

The killer feature. Extracting structured form data:

var response = await textractClient.AnalyzeDocumentAsync(new AnalyzeDocumentRequest
{
    Document = new Document
    {
        S3Object = new Amazon.Textract.Model.S3Object
        {
            Bucket = "documents-bucket",
            Name = "applications/form-001.pdf",
        },
    },
    FeatureTypes = new List<string> { "FORMS" },
});

// Parse key-value pairs
var keyValuePairs = ExtractKeyValuePairs(response.Blocks);

foreach (var (key, value) in keyValuePairs)
{
    Console.WriteLine($"{key}: {value}");
    // Output: "Patient Name: John Smith"
    //         "Date of Birth: 03/15/1985"
    //         "Insurance ID: XYZ-123456"
}

Helper to extract key-value pairs

public static Dictionary<string, string> ExtractKeyValuePairs(List<Block> blocks)
{
    var keyMap = new Dictionary<string, Block>();
    var valueMap = new Dictionary<string, Block>();
    var blockMap = blocks.ToDictionary(b => b.Id);

    foreach (var block in blocks)
    {
        if (block.BlockType == BlockType.KEY_VALUE_SET)
        {
            if (block.EntityTypes.Contains("KEY"))
                keyMap[block.Id] = block;
            else
                valueMap[block.Id] = block;
        }
    }

    var result = new Dictionary<string, string>();

    foreach (var (keyId, keyBlock) in keyMap)
    {
        var keyText = GetTextFromRelationships(keyBlock, blockMap, "CHILD");
        
        // Find the associated VALUE block
        var valueRelation = keyBlock.Relationships?
            .FirstOrDefault(r => r.Type == RelationshipType.VALUE);
        
        if (valueRelation != null)
        {
            var valueBlockId = valueRelation.Ids.First();
            if (blockMap.TryGetValue(valueBlockId, out var valueBlock))
            {
                var valueText = GetTextFromRelationships(valueBlock, blockMap, "CHILD");
                result[keyText.Trim()] = valueText.Trim();
            }
        }
    }

    return result;
}

private static string GetTextFromRelationships(Block block, Dictionary<string, Block> blockMap, string relType)
{
    var children = block.Relationships?
        .FirstOrDefault(r => r.Type == relType)?.Ids ?? new List<string>();
    
    return string.Join(" ", children
        .Where(blockMap.ContainsKey)
        .Select(id => blockMap[id])
        .Where(b => b.BlockType == BlockType.WORD)
        .Select(b => b.Text));
}

Extracting Tables

var response = await textractClient.AnalyzeDocumentAsync(new AnalyzeDocumentRequest
{
    Document = new Document
    {
        S3Object = new Amazon.Textract.Model.S3Object
        {
            Bucket = "documents-bucket",
            Name = "reports/quarterly.pdf",
        },
    },
    FeatureTypes = new List<string> { "TABLES" },
});

// Parse tables into rows/columns
var tables = ExtractTables(response.Blocks);

foreach (var table in tables)
{
    foreach (var row in table.Rows)
    {
        Console.WriteLine(string.Join(" | ", row.Cells.Select(c => c.Text)));
    }
}

Async Processing for Multi-Page Documents

For documents over 1 page, use async processing (results go to S3/SNS):

// Start async job
var startResponse = await textractClient.StartDocumentAnalysisAsync(new StartDocumentAnalysisRequest
{
    DocumentLocation = new DocumentLocation
    {
        S3Object = new Amazon.Textract.Model.S3Object
        {
            Bucket = "documents-bucket",
            Name = "contracts/long-contract.pdf",
        },
    },
    FeatureTypes = new List<string> { "FORMS", "TABLES" },
    NotificationChannel = new NotificationChannel
    {
        SNSTopicArn = snsTopicArn,
        RoleArn = textractRoleArn,
    },
    OutputConfig = new OutputConfig
    {
        S3Bucket = "textract-results",
        S3Prefix = "results/",
    },
});

var jobId = startResponse.JobId;

// Later (triggered by SNS notification)...
var results = await textractClient.GetDocumentAnalysisAsync(new GetDocumentAnalysisRequest
{
    JobId = jobId,
});

// Handle pagination for large documents
while (results.NextToken != null)
{
    results = await textractClient.GetDocumentAnalysisAsync(new GetDocumentAnalysisRequest
    {
        JobId = jobId,
        NextToken = results.NextToken,
    });
    // Process blocks...
}

Document Processing Pipeline

A production pipeline: document uploaded → Textract extracts data → validate → store structured results:

// Lambda: triggered by S3 upload of a new document
public async Task Handler(S3Event s3Event, ILambdaContext context)
{
    foreach (var record in s3Event.Records)
    {
        var key = Uri.UnescapeDataString(record.S3.Object.Key);
        
        // Start async analysis
        var response = await _textractClient.StartDocumentAnalysisAsync(new StartDocumentAnalysisRequest
        {
            DocumentLocation = new DocumentLocation
            {
                S3Object = new Amazon.Textract.Model.S3Object
                {
                    Bucket = record.S3.Bucket.Name,
                    Name = key,
                },
            },
            FeatureTypes = new List<string> { "FORMS", "TABLES" },
            NotificationChannel = new NotificationChannel
            {
                SNSTopicArn = Environment.GetEnvironmentVariable("SNS_TOPIC_ARN")!,
                RoleArn = Environment.GetEnvironmentVariable("TEXTRACT_ROLE_ARN")!,
            },
        });
        
        // Store job tracking info
        await _dynamoRepo.PutAsync(new DocumentJob
        {
            DocumentKey = key,
            JobId = response.JobId,
            Status = "PROCESSING",
            SubmittedAt = DateTime.UtcNow,
        });
    }
}

CDK Setup (C#)

using Amazon.CDK;
using Amazon.CDK.AWS.IAM;

// Textract role (allows Textract to publish to SNS)
var textractRole = new Role(this, "TextractRole", new RoleProps
{
    AssumedBy = new ServicePrincipal("textract.amazonaws.com"),
});
textractRole.AddToPolicy(new PolicyStatement(new PolicyStatementProps
{
    Actions = new[] { "sns:Publish" },
    Resources = new[] { notificationTopic.TopicArn },
}));

// Grant Lambda access to call Textract
processorFunction.AddToRolePolicy(new PolicyStatement(new PolicyStatementProps
{
    Actions = new[]
    {
        "textract:DetectDocumentText",
        "textract:AnalyzeDocument",
        "textract:StartDocumentAnalysis",
        "textract:GetDocumentAnalysis",
    },
    Resources = new[] { "*" },
}));

// Grant Textract access to read from S3
documentBucket.GrantRead(new ServicePrincipal("textract.amazonaws.com"));

Cost

DetectDocumentText: $1.50 per 1,000 pages
AnalyzeDocument (forms): $50 per 1,000 pages
AnalyzeDocument (tables): $15 per 1,000 pages
AnalyzeDocument (forms + tables): $65 per 1,000 pages

Form extraction is the expensive one. If you only need raw text, DetectDocumentText is cheap. Budget carefully for form-heavy workloads.

Tips

Use async processing for anything over 1 page. Synchronous AnalyzeDocument only handles single-page documents. Multi-page PDFs require StartDocumentAnalysis.
Confidence scores matter. Textract returns confidence per field. For high-stakes data (medical records, financial forms), route low-confidence extractions to human review.
Pre-process images. Textract works best with clean, high-contrast scans. Skewed or low-resolution documents produce worse results. Consider image preprocessing (deskew, contrast enhancement) before sending to Textract.
The block model is complex. Textract returns a flat list of blocks with relationships. You need helper code (like the examples above) to reconstruct the document structure. Consider building a reusable TextractParser class for your team.
Combine with Bedrock for understanding. Extract structured data with Textract, then use Bedrock to interpret or summarize it. For example: extract all fields from an insurance claim, then use Claude to flag anomalies.