10+ Years of Experience

Key Responsibilities:

Own production reliability on AWS: availability, latency, throughput, capacity, and incident response

Architect and operate scalable infrastructure (multi-AZ as a baseline; DR strategy and regular testing)

Build and maintain Infrastructure as Code (Terraform / CloudFormation / CDK) and Git-based workflows

Improve CI/CD pipelines and deployment strategies (blue/green, canary, progressive delivery)

Implement strong observability: metrics, logs, traces, alerting, dashboards; define SLO/SLI and reduce noise

Own database operations on AWS (Aurora/RDS MySQL): backups/restores (including restore drills), read replicas, performance troubleshooting, and capacity planning

Improve caching and traffic handling (CDN, Redis/ElastiCache, queues) to sustain peak demand

Harden security posture: IAM least privilege, secrets management, patching, WAF, audit trails

Drive adoption of relevant AWS managed services (where it increases reliability and reduces ops burden)

Drive cloud cost efficiency (FinOps): cost visibility, tagging, budgets/alerts, rightsizing, and smart usage of AWS pricing models without compromising reliability

Lead post-incident reviews (RCA, corrective actions, prevention), and ensure improvements are implemented and verified

Requirements

10+ years of experience in similar role

Strong hands-on AWS in production (typical stack: VPC, IAM, EC2, ALB/NLB, Auto Scaling, S3, CloudFront, Route53, CloudWatch/CloudTrail, WAF; plus Aurora/RDS)

Proven experience designing/operating high-load web systems with strict uptime requirements

IaC and automation mindset (Terraform/CloudFormation/CDK + scripting Bash/Python)

Production MySQL on AWS (Aurora/RDS): backups \& restores (including restore drills), read replicas, monitoring, and performance troubleshooting

Ability to troubleshoot production web stacks (Nginx + PHP-FPM) and identify bottlenecks across app ↔ DB ↔ infrastructure

Containers and deployment automation (ECS/EKS, Docker; understanding of scaling and rollout patterns)

Solid Linux + networking fundamentals (DNS, TLS, routing, LB, troubleshooting)

Observability practices and incident management experience. Must be reachable for critical production incidents; occasional after-hours support may be required (critical-only)

Job Type: Full-time

Work Location: Remote

Senior DevOps / SRE Engineer

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox