Position purpose:
The Senior Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of cloud\-based software solutions. This role blends software engineering and systems administration to support and enhance critical infrastructure, working closely with development and operations teams to deliver secure and cost\-effective cloud environments.
Essential Duties and Responsibilities:
Cloud Infrastructure Architecture and Implementation:
Designs, builds, and maintains robust cloud infrastructure solutions using AWS and other cloud technologies.
Mentorship and Team Development: Provides
technical guidance and mentorship to junior SREs, promoting a culture of continuous learning and improvement.
Operational Efficiency and Automation:
Identifies and implements process improvements through automation and optimization to enhance reliability and reduce manual effort.
Performance and Reliability Management:
Develops and executes strategies to meet and exceed Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Incident Management:
Leads incident response efforts, perform root cause analysis, and implement preventive measures to minimize downtime.
Capacity Planning and System Optimization
: Proactively identifies performance bottlenecks, optimize resource utilization, and ensure system scalability.
Security and Compliance:
Implements cloud security best practices, including least\-privilege IAM policies, secrets management, and evidence generation for compliance frameworks (e.g., SOC 2, ISO 27001\).
Other duties and projects as assigned.
Required Skills/Abilities:
Strong problem\-solving, troubleshooting, and analytical skills.
Excellent communication and collaboration abilities.
Organizational skills with attention to detail.
Ability to manage time and prioritize tasks.
Proficiency in scripting languages (e.g., Python, PowerShell).
In\-depth knowledge of Linux systems, networking, load balancing, and security principles.
Experience:
5\+ years in Site Reliability Engineering or a similar role.
Extensive expertise in AWS (Amazon Web Services) cloud platform and services.
Experience with GitOps practices and CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD, or similar).
Experience with Infrastructure as Code (e.g., Terraform).
Experience designing and maintaining observability stacks (e.g., Prometheus, Grafana, ELK) with a focus on actionable metrics, alerting, and SLOs.