How when AWS was down, we were not
Brief summary
Authress avoided downtime during the AWS us-east-1 outage by implementing a multi-region, redundant infrastructure with automated DNS failover using custom health checks, edge-optimized routing, and robust anomaly detection, backed by rigorous testing and incremental deployments to minimize risk and impact. Their system design assumes failure is inevitable and focuses on quick detection, seamless failover, and minimizing single points of failure through automation and continuous validation.
Long summary
- Authress experienced a major AWS us-east-1 outage affecting DynamoDB and other critical AWS services.
- They run infrastructure in us-east-1 due to customer location demands, despite known risks.
- AWS services like CloudFront, Certificate Manager, Lambda@Edge, and IAM control planes are centralized in us-east-1, impacting availability during incidents.
- Aiming for a 5-nines SLA (99.999% uptime) requires more than relying on AWS SLAs alone, which are insufficient.
- Simple single-region architectures fail to meet high reliability due to frequent AWS incidents.
- Authress recognizes "everything fails all the time" and designs systems assuming failure.
- Retry strategies are mathematically analyzed; third-party components must have at least 99.7% reliability to be usable.
- Multi-region redundant infrastructure with DNS failover via AWS Route 53 health checks enables automatic failover.
- Custom health checks validate actual service health beyond default DNS checks.
- Edge-optimized architecture using CloudFront and Lambda@Edge improves latency and provides better failover options.
- DynamoDB Global Tables replicate data across regions to support failover.
- Rigorous testing and validation, including application-level tests, mitigate risks of bugs in production.
- Incremental deployment (customer buckets) limits impact by rolling out changes gradually.
- Asynchronous validation tests check consistency across databases after deployments.
- Anomaly detection is used to identify meaningful incidents impacting business logic, beyond mere HTTP error codes.
- Customer support feedback is integrated into incident detection to catch undetected or gray failures.
- Security measures include rate limiting, AWS WAF with IP reputation lists, and blocking suspicious high-volume requests.
- Resource exhaustion prevention is critical, with rate limiting implemented at multiple infrastructure layers.
- Infrastructure as Code (IaC) deployment differences across regions and edge leads to challenges in consistency.
- Despite all these measures, achieving a true 5-nines SLA is extremely challenging but remains a core commitment.
Summary of HN discussion
https://news.ycombinator.com/item?id=45955565
- The discussion highlights concerns about automation and Infrastructure as Code (IaC) being potential failure points, emphasizing the challenge of safely updating these systems.
- Rollbacks are rarely automatic; often, knowing in advance to avoid certain rollouts is preferable as automated rollbacks can worsen failures.
- Simple, less complex infrastructure changes are preferred to reduce human error, which is the leading cause of incidents.
- There is skepticism about the reliability of Route 53 failover in practice, with concerns about its failure modes and the complexity of multi-cloud DNS failover.
- Some contributors suggest modular IaC approaches (Pulumi, Terragrunt) for safer, repeatable deployments but warn about added complexity.
- Retry logic in failures is criticized; retries may not improve reliability linearly due to correlated failures and overall system overload during outages.
- Latency and client timeout constraints limit the practical number of retries possible.
- DNS is acknowledged as a single point of failure with caching and failover timing challenges.
- Multi-cloud failover at DNS level is complex, costly, and not widely implemented due to infrastructure and coordination requirements.
- Gray failures (where the system reports healthy but customers experience issues) and the difficulty in knowing real incident impact without customer feedback are noted.
- Customer support is critical in incident detection since automated systems cannot catch every failure.
- Detailed monitoring via CloudFront and telemetry helps identify actual service issues during outages.
- Overall, the theme is the difficulty in achieving perfect reliability, the importance of simplicity, and the need for layered detection and response strategies to manage failures.







