10+ Years of Experience
Key Responsibilities:
Own production reliability on AWS: availability, latency, throughput, capacity, and incident response
Architect and operate scalable infrastructure (multi-AZ as a baseline; DR strategy and regular testing)
Build and maintain Infrastructure as Code (Terraform / CloudFormation / CDK) and Git-based workflows
Improve CI/CD pipelines and deployment strategies (blue/green, canary, progressive delivery)
Implement strong observability: metrics, logs, traces, alerting, dashboards; define SLO/SLI and reduce noise
Own database operations on AWS (Aurora/RDS MySQL): backups/restores (including restore drills), read replicas, performance troubleshooting, and capacity planning
Improve caching and traffic handling (CDN, Redis/ElastiCache, queues) to sustain peak demand
Harden security posture: IAM least privilege, secrets management, patching, WAF, audit trails
Drive adoption of relevant AWS managed services (where it increases reliability and reduces ops burden)
Drive cloud cost efficiency (FinOps): cost visibility, tagging, budgets/alerts, rightsizing, and smart usage of AWS pricing models without compromising reliability
Lead post-incident reviews (RCA, corrective actions, prevention), and ensure improvements are implemented and verified
Requirements
10+ years of experience in similar role
Strong hands-on AWS in production (typical stack: VPC, IAM, EC2, ALB/NLB, Auto Scaling, S3, CloudFront, Route53, CloudWatch/CloudTrail, WAF; plus Aurora/RDS)
Proven experience designing/operating high-load web systems with strict uptime requirements
IaC and automation mindset (Terraform/CloudFormation/CDK + scripting Bash/Python)
Production MySQL on AWS (Aurora/RDS): backups \& restores (including restore drills), read replicas, monitoring, and performance troubleshooting
Ability to troubleshoot production web stacks (Nginx + PHP-FPM) and identify bottlenecks across app ↔ DB ↔ infrastructure
Containers and deployment automation (ECS/EKS, Docker; understanding of scaling and rollout patterns)
Solid Linux + networking fundamentals (DNS, TLS, routing, LB, troubleshooting)
Observability practices and incident management experience. Must be reachable for critical production incidents; occasional after-hours support may be required (critical-only)
Job Type: Full-time
Work Location: Remote