SRE/Observability Engineer

Toronto - Hybrid (1-2 days)

Role Summary

We are looking for a Observability Engineer to help implement, operate, and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding, instrumentation, dashboarding, and alerting, working under established standards and guidance from senior engineers.

You will collaborate with application, SRE, and operations teams to ensure systems are observable, supportable, and production-ready.

Key Responsibilities

Observability Implementation

Implement and maintain metrics, logs, and traces for applications and infrastructure

Assist with onboarding applications into observability platforms (e.g., Dynatrace, ELK, Datadog)

Configure dashboards, alerts, and basic anomaly detection

Application Support \& Instrumentation

Work with development teams to enable structured logging, basic distributed tracing, and core metrics

Validate observability requirements during Production Readiness Reviews (PRR)

Troubleshoot missing or low-quality telemetry

Monitoring \& Alerting

Configure alerts based on golden signals (latency, errors, traffic, saturation)

Help reduce alert noise by tuning thresholds and alert logic

Support incident response by gathering logs, metrics, and traces

Operations \& Reliability

Support root cause analysis using observability tools

Maintain dashboards and documentation used by on-call and support teams

Participate in on-call rotations (as applicable)

Automation \& Continuous Improvement

Assist in automating observability onboarding and validation tasks

Create and maintain reusable dashboards and alert templates

Follow established observability standards and best practices

Required Qualifications

2–4 years of experience in Observability, or SRE

Working knowledge of metrics, logs, and basic tracing concepts

Hands-on experience with at least one observability platform (Dynatrace, Elastic/ELK, Datadog, New Relic, etc.)

Basic understanding of SLIs/SLOs and service health indicators

Experience with cloud platforms or hybrid environments

Ability to write scripts (Python, Bash, PowerShell) for automation and troubleshooting

Preferred Qualifications

Experience with OpenTelemetry or APM agents

Familiarity with Kubernetes or containerized workloads

Experience working with incident management tools (PagerDuty, ServiceNow)

Exposure to Dynatrace/Kibana ELK or similar cloud-native monitoring

Experience in regulated or enterprise environments

Site Reliability Engineer

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox