Our client in the professional services sector is seeking an SRE to enhance service resiliency, automation, and operational excellence across critical cloud and on-prem workloads.
This role focuses on stability engineering, cloud migration readiness, observability, continuous improvement, and L2 production support.
Location: Hybrid 3d Toronto
Duration: 6 months + possible extension
Responsibilities
Drive service stability, automation, and optimization initiatives, ensuring compliance with SLAs and operational best practices.
Support AWS cloud workload migrations by validating readiness, ensuring observability coverage, and developing Day-2 runbook automation.
Collaborate with DevSecOps and Architecture teams to operationalize cloud-native and migrated applications using automated deployment, monitoring, and recovery pipelines.
Implement scalable solutions to reduce manual intervention, improve deployment efficiency, and enable self-healing and auto-scaling.
Use tools such as Splunk, Dynatrace, and Grafana to optimize performance, implement anomaly detection, and proactively address production issues.
Analyze trends from testing and production environments, conduct root-cause investigations, and recommend corrective actions to Agile squads.
Maintain clear technical and operational documentation including runbooks, SOPs, post-mortems, and architecture overviews.
Provide leadership in vulnerability remediation, security alignment, and technology lifecycle management.
Participate in a 24/7 rotating on-call schedule, providing L2 support and rapid response to production incidents.
Required Skills \& Qualifications
Strong AppOps experience supporting highly resilient and high-performance workloads on
AWS
.
Proficiency with Git, PowerShell, Python, Ansible, Terraform, Docker, and microservices patterns.
Hands-on experience with Splunk, Dynatrace, Grafana, and ServiceNow.
Solid understanding of
AWS ECS architecture
, autoscaling, load balancing, and VPC integration.
Strong knowledge of Agile methodologies, SDLC processes, release management, and incident/problem/change management.
Ability to analyze data, identify issues early, and mitigate risks in production environments.
Excellent communication skills for working across technical and business stakeholders.
Nice to Have
SRE certification
Experience with OpenShift Kubernetes
Familiarity with DevOps concepts and CI/CD
Knowledge of Oracle or PostgreSQL
AWS certifications
Financial Services or Payments experience
ITIL certification