Our client in the professional services sector is seeking an SRE to enhance service resiliency, automation, and operational excellence across critical cloud and on-prem workloads.

This role focuses on stability engineering, cloud migration readiness, observability, continuous improvement, and L2 production support.

Location: Hybrid 3d Toronto

Duration: 6 months + possible extension

Responsibilities

Drive service stability, automation, and optimization initiatives, ensuring compliance with SLAs and operational best practices.

Support AWS cloud workload migrations by validating readiness, ensuring observability coverage, and developing Day-2 runbook automation.

Collaborate with DevSecOps and Architecture teams to operationalize cloud-native and migrated applications using automated deployment, monitoring, and recovery pipelines.

Implement scalable solutions to reduce manual intervention, improve deployment efficiency, and enable self-healing and auto-scaling.

Use tools such as Splunk, Dynatrace, and Grafana to optimize performance, implement anomaly detection, and proactively address production issues.

Analyze trends from testing and production environments, conduct root-cause investigations, and recommend corrective actions to Agile squads.

Maintain clear technical and operational documentation including runbooks, SOPs, post-mortems, and architecture overviews.

Provide leadership in vulnerability remediation, security alignment, and technology lifecycle management.

Participate in a 24/7 rotating on-call schedule, providing L2 support and rapid response to production incidents.

Required Skills \& Qualifications

Strong AppOps experience supporting highly resilient and high-performance workloads on

AWS

Proficiency with Git, PowerShell, Python, Ansible, Terraform, Docker, and microservices patterns.

Hands-on experience with Splunk, Dynatrace, Grafana, and ServiceNow.

Solid understanding of

AWS ECS architecture

, autoscaling, load balancing, and VPC integration.

Strong knowledge of Agile methodologies, SDLC processes, release management, and incident/problem/change management.

Ability to analyze data, identify issues early, and mitigate risks in production environments.

Excellent communication skills for working across technical and business stakeholders.

Nice to Have

SRE certification

Experience with OpenShift Kubernetes

Familiarity with DevOps concepts and CI/CD

Knowledge of Oracle or PostgreSQL

AWS certifications

Financial Services or Payments experience

ITIL certification

Site Reliability Engineer (DevOps/Release)

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox