Job Summary
We are seeking a
Senior Site Reliability Engineer
to lead initiatives in maintaining and improving the reliability, scalability, and efficiency of our production systems. You will design automation, implement observability frameworks, and mentor junior SREs while collaborating closely with DevOps and engineering teams. This is an excellent opportunity for an experienced engineer with 5+ years of experience who thrives on ownership, technical leadership, and driving operational excellence.
Key Responsibilities
Reliability Leadership:
Lead initiatives to improve system reliability, performance, and scalability.
Automation \& Remediation:
Design and implement automated workflows, deployment safety checks, and auto-remediation processes.
Monitoring \& SLO Management:
Define, enforce, and monitor SLOs/SLIs, ensuring alignment with business objectives.
Incident Response:
Lead incident response for complex issues, perform postmortems, and implement preventive measures.
Mentorship \& Collaboration:
Mentor junior SREs, collaborate with DevOps and engineering teams, and promote reliability best practices.
Runbooks \& Documentation:
Maintain and enhance operational runbooks, ensuring production knowledge is standardized.
Capacity \& Disaster Planning:
Contribute to capacity planning, scaling strategies, and disaster recovery design.
Job Requirements
Experience:
Minimum 5 years in SRE, DevOps, or related roles, with demonstrated ownership of production systems.
Tech Stack:
Cloud platforms (AWS, GCP, Azure), containerization (Docker, Kubernetes), IaC (Terraform, Ansible, Helm), CI/CD pipelines, monitoring tools (Prometheus, Grafana, ELK/EFK, Datadog).
Systems Knowledge:
Expert in distributed systems, networking, and designing fault-tolerant infrastructure.
Scripting \& Automation:
Proficiency in Python, Go, Bash,Java, or similar languages with automation experience.
Problem Solving \& System Design:
Skilled at diagnosing complex incidents, designing resilient systems, and driving continuous improvement.
Soft Skills
Leadership:
Lead initiatives, mentor team members, and drive adoption of best practices.
Ownership:
Take responsibility for reliability, availability, and operational excellence.
Adaptability:
Thrive in evolving technical landscapes and complex production environments.
Communication:
Effectively convey technical concepts to both technical and non-technical stakeholders.
What We Offer
Technical Leadership Opportunities:
Lead high-impact reliability and automation projects.
Continuous Growth:
Access to mentorship, certifications, and career progression.
High-Performance Collaboration:
Work with a talented team in an Agile/CI-CD environment.
Flexibility and Trust:
An open culture that values innovation, autonomy, and results-driven decision-making.