Multi-Region Failover Infrastructure

Problem

Enterprise applications need to stay up even when an entire AWS region goes down. The goal was to design and deploy a containerized application that could survive a full regional outage with minimal downtime, and prove it with real failover testing. Also needed a data pipeline that could process uploaded financial documents automatically.

Approach

Deployed the same ECS service in two AWS regions (us-west-1 as primary, us-east-2 as failover), each behind its own ALB with Auto Scaling Groups. Deliberately chose EC2 launch type over Fargate to demonstrate deeper control over infrastructure, instance-level scaling, and integration with ASGs. Created launch templates and ASGs tightly coupled with ECS capacity providers so container workloads scaled in tandem with underlying compute.

Scaling policies were built around ALB request count metrics rather than CPU-based triggers. The application was lightweight enough that CPU would never realistically spike, so request count was the honest metric for traffic-driven scaling. Route 53 health checks pinged both regions continuously; failover happened within about 60 seconds of primary going down. A key engineering challenge was that the lightweight frontend didn't naturally generate enough load, so demonstrations were designed to artificially generate traffic and simulate failure conditions.

WAF configuration included AWS Managed Rules, rate-based rules to block excessive requests from single IPs, and XSS protections. CloudWatch dashboards visualize system health, traffic patterns, and scaling behavior across both regions. SNS alerts fire on scaling activities and health check failures. Separately, built a serverless ETL pipeline: financial documents land in S3, trigger a Lambda function that extracts transaction data, transforms it, and loads results into DynamoDB.

Architecture

Tech stack

AWS ECS Docker Route 53 ALB Lambda (Python) DynamoDB CloudWatch SNS WAF S3

Status

Deployed and validated. Failover tested with simulated regional outage. ~99.99% measured availability.

Multi-region failover infrastructure

Problem

Approach

Architecture

Tech stack

Status