Do you aspire to take on a strategic, leadership-oriented role where you design and guide infrastructure at an architectural level? Are you passionate about identifying and solving complex operational challenges, improving system reliability, and driving modernization? Do you thrive on designing scalable, fault-tolerant systems and implementing automation that transforms on-prem applications into cloud-native solutions? If so, this role is the perfect next step in your journey!

As a Senior Site Reliability Engineer, you’ll take ownership of critical infrastructure and reliability initiatives that power our applications and services. You’ll design, automate, and optimize systems to improve performance, scalability, and operational efficiency — while driving adoption of reliability best practices across the engineering organization.

You’ll act as a technical leader and mentor, collaborating closely with developers, operations, and security teams to solve complex challenges and advance our cloud-first strategy. This role requires deep technical expertise, sound judgment, and the ability to translate reliability goals into measurable outcomes that benefit both our users and our business.

Location and Logistics

Hybrid role requiring 3+ days per week in our Bellevue, WA office

Local candidates will be interviewed in-person in the Bellevue office

We are unable to offer visa sponsorship, visa transfer, or corp-to-corp arrangements

Key Responsibilities:

Cloud Strategy and Architecture

Own key cloud architecture initiatives, guiding design decisions for scalability, security, and cost efficiency

Partner with architecture and engineering leadership to define modernization standards and patterns (containerization, microservices, serverless)

Evaluate and introduce emerging cloud technologies to enhance performance, reliability, and developer autonomy

Drive adoption of a cloud-first mindset and infrastructure best practices across teams

Infrastructure Automation \& Design

Lead design and implementation of infrastructure automation using IaC tools such as Terraform, Terragrunt, and Puppet

Apply GitOps principles for configuration management and application delivery

Build and maintain CI/CD pipelines that ensure reliable, repeatable deployments (GitLab preferred)

Develop reusable, modular automation components and mentor others on automation standards

Reliability and Performance Engineering

Define and own service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical systems

Drive continuous improvement of uptime, latency, and scalability through instrumentation and testing

Implement and evolve observability stacks (monitoring, logging, tracing) using Datadog, Prometheus, Grafana, or similar tools

Conduct capacity planning, load testing, and chaos engineering to proactively identify weaknesses

Incident Management \& Resilience

Lead incident response for critical production systems, ensuring rapid recovery and clear communication

Facilitate blameless post-incident reviews and drive remediation of root causes

Develop and maintain operational runbooks, escalation paths, and playbooks

Advocate for a culture of transparency, accountability, and learning within incident management

Security \& Compliance

Partner with Security Engineering to implement secure infrastructure-by-default designs and monitor compliance with PCI, SOC2, and other standards

Proactively detect, investigate, and remediate security vulnerabilities and misconfigurations

Integrate security scanning and validation into CI/CD pipelines

Disaster Recovery \& Business Continuity

Design and maintain disaster recovery (DR) and business continuity strategies

Test and validate RPO/RTO targets regularly, ensuring operational readiness and audit compliance

Cost Management \& FinOps

Monitor and optimize cloud resource utilization through data-driven FinOps practices

Collaborate with finance and engineering stakeholders to improve cost visibility and accountability

Mentorship, Collaboration \& Knowledge Sharing

Mentor peers and junior engineers through design reviews, code reviews, and paired work

Lead by example in documentation, automation quality, and technical decision-making

Partner with cross-functional teams to align reliability initiatives with product and business objectives

Contribute to a culture of continuous learning and operational excellence

Qualifications:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.

5+ years of experience as a Site Reliability Engineer or in a similar role, working with highly available and production environments.

Proficiency in AWS and containerization technologies like Kubernetes and Docker.

Strong experience with Infrastructure as Code (IaC) using Terraform, with automation scripting skills in Python, Bash/Shell, or Go.

Deep knowledge of Linux/Unix systems and networking fundamentals (e.g., TCP/IP, DNS, HTTP, VPN).

Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana) and incident management.

Familiarity with CI/CD pipelines, preferably using tools like GitLab, and strong knowledge of DevOps practices.

Excellent troubleshooting skills, with experience in performance optimization and root cause analysis.

Strong communication and collaboration skills.

Bonus Skills:

Tools: Rundeck, Vector, Loki, VictoriaMetrics

Frameworks: Java, Spring, Go

Multi-cloud experience (Azure, GCP)

Certifications: AWS Solutions Architect, Certified Kubernetes Administrator (CKA)

What Success Looks Like

Core services consistently meet or exceed SLOs and error budgets

Infrastructure deployments are automated, reproducible, and observable

Cost efficiency and system performance improve through data-driven insights

Post-incident reviews lead to measurable reliability gains

The team benefits from your mentorship, leadership, and technical influence

Classmates

Classmates is the premier online, social, and mobile destination for reconnecting with the people from your high school years. Classmates offers the largest digitized collection of high school yearbooks online, with over 450,000 available to view, tag, sign, and share, and has the most comprehensive directory of high schools and class lists from the 1940s to today.

Salary Range:

Min: $152,700

Mid: $170,800

Max: $190,600

The pay range reflects the salary amount the Company reasonably expects to pay for the position. It is not a guarantee of actual compensation or a specific payment amount to any candidate. The actual compensation will depend on numerous factors including, without limitation, a particular candidate’s experience and qualifications.

The Company's Applicant and Worker Privacy Notice can be found here.

PeopleConnect is an equal opportunity employer.

Local area candidates are encouraged to apply, and please note we are not able to offer visa sponsorship, visa transfer, or corp-corp arrangements.

Note for Principal Agencies - Principal agents should not forward resumes to PeopleConnect, as we will not be responsible for any fees arising from the use of resumes submitted from agencies without a prior written and signed agreement and authorized job order for this position in place.

PeopleConnect, Inc. is an equal opportunity employer

Senior Site Reliability Engineer Classmates

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox