SRE/Observability Engineer
Toronto - Hybrid (1-2 days)
Role Summary
We are looking for a Observability Engineer to help implement, operate, and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding, instrumentation, dashboarding, and alerting, working under established standards and guidance from senior engineers.
You will collaborate with application, SRE, and operations teams to ensure systems are observable, supportable, and production-ready.
Key Responsibilities
Observability Implementation
Implement and maintain metrics, logs, and traces for applications and infrastructure
Assist with onboarding applications into observability platforms (e.g., Dynatrace, ELK, Datadog)
Configure dashboards, alerts, and basic anomaly detection
Application Support \& Instrumentation
Work with development teams to enable structured logging, basic distributed tracing, and core metrics
Validate observability requirements during Production Readiness Reviews (PRR)
Troubleshoot missing or low-quality telemetry
Monitoring \& Alerting
Configure alerts based on golden signals (latency, errors, traffic, saturation)
Help reduce alert noise by tuning thresholds and alert logic
Support incident response by gathering logs, metrics, and traces
Operations \& Reliability
Support root cause analysis using observability tools
Maintain dashboards and documentation used by on-call and support teams
Participate in on-call rotations (as applicable)
Automation \& Continuous Improvement
Assist in automating observability onboarding and validation tasks
Create and maintain reusable dashboards and alert templates
Follow established observability standards and best practices
Required Qualifications
2–4 years of experience in Observability, or SRE
Working knowledge of metrics, logs, and basic tracing concepts
Hands-on experience with at least one observability platform (Dynatrace, Elastic/ELK, Datadog, New Relic, etc.)
Basic understanding of SLIs/SLOs and service health indicators
Experience with cloud platforms or hybrid environments
Ability to write scripts (Python, Bash, PowerShell) for automation and troubleshooting
Preferred Qualifications
Experience with OpenTelemetry or APM agents
Familiarity with Kubernetes or containerized workloads
Experience working with incident management tools (PagerDuty, ServiceNow)
Exposure to Dynatrace/Kibana ELK or similar cloud-native monitoring
Experience in regulated or enterprise environments