Job Description: Site Reliability Engineer (SRE) – Observability
Toronto - Hybrid (1-2 days office)
Role Summary
We are looking for a Observability Engineer to help implement, operate, and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding, instrumentation, dashboarding, and alerting, working under established standards and guidance from senior engineers.
You will collaborate with application, SRE, and operations teams to ensure systems are observable, supportable, and production-ready.
Key Responsibilities
Observability Implementation
Implement and maintain metrics, logs, and traces for applications and infrastructure • Assist with onboarding applications into observability platforms (e.g., Dynatrace, ELK, Datadog) • Configure dashboards, alerts, and basic anomaly detection Application Support \& Instrumentation • Work with development teams to enable structured logging, basic distributed tracing, and core metrics • Validate observability requirements during Production Readiness Reviews (PRR) • Troubleshoot missing or low-quality telemetry Monitoring \& Alerting • Configure alerts based on golden signals (latency, errors, traffic, saturation) • Help reduce alert noise by tuning thresholds and alert logic • Support incident response by gathering logs, metrics, and traces Operations \& Reliability • Support root cause analysis using observability tools • Maintain dashboards and documentation used by on-call and support teams • Participate in on-call rotations (as applicable) Automation \& Continuous Improvement • Assist in automating observability onboarding and validation tasks • Create and maintain reusable dashboards and alert templates • Follow established observability standards and best practices Required Qualifications • 2–4 years of experience in Observability, or SRE • Working knowledge of metrics, logs, and basic tracing concepts • Hands-on experience with at least one observability platform (Dynatrace, Elastic/ELK, Datadog, New Relic, etc.) • Basic understanding of SLIs/SLOs and service health indicators • Experience with cloud platforms or hybrid environments • Ability to write scripts (Python, Bash, PowerShell) for automation and troubleshooting
Preferred Qualifications
Experience with OpenTelemetry or APM agents • Familiarity with Kubernetes or containerized workloads • Experience working with incident management tools (PagerDuty, ServiceNow) • Exposure to Dynatrace/Kibana ELK or similar cloud-native monitoring • Experience in regulated or enterprise environments