We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP).

This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments.

As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required.

Responsibilities

Monitoring \& Observability (Core Focus)

Own and operate the monitoring and observability stack across on-prem and GCP environments

Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications

Define, tune, and maintain alerts to ensure high signal-to-noise ratio

Establish observability standards and best practices across teams

Improve visibility into system health, performance, and reliability

Site Reliability Engineering

Apply SRE principles to improve availability, performance, and resilience

Define and track SLIs, SLOs, and error budgets

Participate in on-call rotations and SEV incident response

Lead or contribute to incident investigations and root cause analysis (RCA)

Drive preventative actions to reduce repeat incidents

Kubernetes \& Platform Reliability

Support and monitor Kubernetes environments (GKE and on-prem clusters)

Monitor cluster health, capacity, and resource utilization

Troubleshoot platform-level issues impacting application reliability

Collaborate with Platform and Engineering teams on reliability improvements

Secondary Responsibilities (Backup Application Support)

These responsibilities are activated as needed, not part of day-to-day operations

Provide L2/L3 application support coverage during:

Support team resource shortages

High-severity incidents (SEVs)

Peak support periods or escalations

Triage and troubleshoot application issues using existing runbooks and dashboards

Collaborate with Application Support and Engineering teams during incidents

Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

Requirements

Strong experience as a Site Reliability Engineer or Reliability Engineer

Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting)

Solid experience with monitoring and observability systems

Production experience operating Kubernetes environments

Experience supporting systems in GCP and on-prem environments

Strong Linux systems and troubleshooting skills

Fluent English (written and spoken)

Ability to work in PST time zone

Ability to participate in an on-call rotation that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule

Technology Stack:

Observability: Grafana, Prometheus, logging platforms

Containers: Kubernetes (GKE and on-prem)

Cloud: Google Cloud Platform (GCP)

Operations: Linux, networking, infrastructure monitoring

Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents)

Nice to have:

Experience supporting application teams during SEV incidents

Knowledge of capacity planning and performance tuning

Scripting skills (Python, Bash, etc.)

Experience with hybrid infrastructure environments

Benefits

At Devsu, we believe in creating an environment where you can thrive both personally and professionally. By joining our team, you'll enjoy:

A stable, long-term contract with opportunities for career growth

Private health insurance

A remote-friendly culture that promotes work-life balance

Continuous training, mentorship, and learning programs to keep you at the forefront of the industry

Free access to AI training resources and state-of-the-art AI tools to elevate your daily work

A flexible Paid Time Off (PTO) policy as well as paid holiday days

Challenging, world-class software projects for clients in the US and LatAm

Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment

Join Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.

Site Reliability Engineer (SRE) - GCP

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox