Job Description: Site Reliability Engineer (SRE) – Observability

Toronto - Hybrid (1-2 days office)

Role Summary

We are looking for a Observability Engineer to help implement, operate, and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding, instrumentation, dashboarding, and alerting, working under established standards and guidance from senior engineers.

You will collaborate with application, SRE, and operations teams to ensure systems are observable, supportable, and production-ready.

Key Responsibilities

Observability Implementation

Implement and maintain metrics, logs, and traces for applications and infrastructure • Assist with onboarding applications into observability platforms (e.g., Dynatrace, ELK, Datadog) • Configure dashboards, alerts, and basic anomaly detection Application Support \& Instrumentation • Work with development teams to enable structured logging, basic distributed tracing, and core metrics • Validate observability requirements during Production Readiness Reviews (PRR) • Troubleshoot missing or low-quality telemetry Monitoring \& Alerting • Configure alerts based on golden signals (latency, errors, traffic, saturation) • Help reduce alert noise by tuning thresholds and alert logic • Support incident response by gathering logs, metrics, and traces Operations \& Reliability • Support root cause analysis using observability tools • Maintain dashboards and documentation used by on-call and support teams • Participate in on-call rotations (as applicable) Automation \& Continuous Improvement • Assist in automating observability onboarding and validation tasks • Create and maintain reusable dashboards and alert templates • Follow established observability standards and best practices Required Qualifications • 2–4 years of experience in Observability, or SRE • Working knowledge of metrics, logs, and basic tracing concepts • Hands-on experience with at least one observability platform (Dynatrace, Elastic/ELK, Datadog, New Relic, etc.) • Basic understanding of SLIs/SLOs and service health indicators • Experience with cloud platforms or hybrid environments • Ability to write scripts (Python, Bash, PowerShell) for automation and troubleshooting

Preferred Qualifications

Experience with OpenTelemetry or APM agents • Familiarity with Kubernetes or containerized workloads • Experience working with incident management tools (PagerDuty, ServiceNow) • Exposure to Dynatrace/Kibana ELK or similar cloud-native monitoring • Experience in regulated or enterprise environments

Site Reliability Engineer (SRE) – Observability

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox