Role: AppOps Senior Specialist

Location: Downtown Toronto, Bay Street - 3 days onsite

Responsibilities

Driving, implementing, and prioritizing service stability and automation/optimization initiatives, ensuring adherence and compliance with Service Level Agreements (SLAs).

Supporting cloud workload migrations from legacy infrastructure by validating readiness, ensuring observability coverage, and implementing runbook automation for Day 2 operations.

Partnering with DevSecOps and architecture teams to operationalize cloud-native and migrated applications by building automated deployment, monitoring, and recovery pipelines.

Developing and implementing scalable solutions to reduce manual intervention and streamline application deployments, validations, and availability, including self-healing and auto-scaling mechanisms.

Optimizing performance by leveraging monitoring and diagnostic tools (e.g., Dynatrace, Grafana, Splunk) and implementing automated anomaly detection and alerting strategies.

Analyzing testing and production trends, proactively identifying and mitigating emerging issues in collaboration with Agile squads, and driving detailed root cause analyses and performance tuning.

Creating and maintaining detailed technical and operational documentation, including runbooks, SOPs, post-mortems, and architecture overviews to support knowledge sharing and incident readiness.

Providing technical leadership in maintaining security, technology currency, and vulnerability remediation, with a continuous improvement mindset and a strong focus on resiliency.

Participating in a 24/7 rotating on-call production support schedule, responding swiftly to escalated issues, and proactively mitigating risk and service disruptions across environments. L2 Support

Skill set Required

(

Must have technologies)

Bachelor's degree in computer science, Information Technology, or related field.

Experience in AppOps with highly resilient and highly performant production workloads with AWS

Expertise in using version control systems, such as Git, for managing codebases and collaboration.

Expertise with scripting e.g. Power Shell, Python, and Ansible.

Experience with Terraform and Infrastructure as Code

Experience with microservices and containerization (i.e., Docker).

Experience using tools such as Splunk, Dynatrace, Grafana, ServiceNow.

Good understanding of AWS ECS core architectural concepts: clusters, task definitions, tasks, services, autoscaling (service and cluster auto scaling), and networking integration (such as with Elastic Load Balancing and AWS VPC).

Strong knowledge of Agile development and Agile frameworks (e.g. JIRA, GitHub)

Good understanding of software lifecycle including release planning, release processes, software testing, incident, problem and change management.

Well versed in analyzing data, identifying and proactively mitigating risk to production systems.

Strong written and verbal communication skills and can seamlessly communicate with internal and external stakeholders.

Nice to Have

Site Reliability Engineering (SRE) Certification

Experience with OpenShift Kubernetes

Experience with DevOps concepts

Good understanding of database systems such as Oracle, PostgreSQL

AWS certifications

Experience working with OpenShift

Experience working in the Financial and Payments services.

ITIL Certification

Site Reliability Engineer

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox