Do you aspire to take on a strategic, leadership-oriented role where you design and guide infrastructure at an architectural level? Are you passionate about identifying and solving complex operational challenges, improving system reliability, and driving modernization? Do you thrive on designing scalable, fault-tolerant systems and implementing automation that transforms on-prem applications into cloud-native solutions? If so, this role is the perfect next step in your journey!
As a Senior Site Reliability Engineer, you’ll take ownership of critical infrastructure and reliability initiatives that power our applications and services. You’ll design, automate, and optimize systems to improve performance, scalability, and operational efficiency — while driving adoption of reliability best practices across the engineering organization.
You’ll act as a technical leader and mentor, collaborating closely with developers, operations, and security teams to solve complex challenges and advance our cloud-first strategy. This role requires deep technical expertise, sound judgment, and the ability to translate reliability goals into measurable outcomes that benefit both our users and our business.
Location and Logistics
Hybrid role requiring 3+ days per week in our Bellevue, WA office
Local candidates will be interviewed in-person in the Bellevue office
We are unable to offer visa sponsorship, visa transfer, or corp-to-corp arrangements
Key Responsibilities:
Cloud Strategy and Architecture
Own key cloud architecture initiatives, guiding design decisions for scalability, security, and cost efficiency
Partner with architecture and engineering leadership to define modernization standards and patterns (containerization, microservices, serverless)
Evaluate and introduce emerging cloud technologies to enhance performance, reliability, and developer autonomy
Drive adoption of a cloud-first mindset and infrastructure best practices across teams
Infrastructure Automation \& Design
Lead design and implementation of infrastructure automation using IaC tools such as Terraform, Terragrunt, and Puppet
Apply GitOps principles for configuration management and application delivery
Build and maintain CI/CD pipelines that ensure reliable, repeatable deployments (GitLab preferred)
Develop reusable, modular automation components and mentor others on automation standards
Reliability and Performance Engineering
Define and own service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical systems
Drive continuous improvement of uptime, latency, and scalability through instrumentation and testing
Implement and evolve observability stacks (monitoring, logging, tracing) using Datadog, Prometheus, Grafana, or similar tools
Conduct capacity planning, load testing, and chaos engineering to proactively identify weaknesses
Incident Management \& Resilience
Lead incident response for critical production systems, ensuring rapid recovery and clear communication
Facilitate blameless post-incident reviews and drive remediation of root causes
Develop and maintain operational runbooks, escalation paths, and playbooks
Advocate for a culture of transparency, accountability, and learning within incident management
Security \& Compliance
Partner with Security Engineering to implement secure infrastructure-by-default designs and monitor compliance with PCI, SOC2, and other standards
Proactively detect, investigate, and remediate security vulnerabilities and misconfigurations
Integrate security scanning and validation into CI/CD pipelines
Disaster Recovery \& Business Continuity
Design and maintain disaster recovery (DR) and business continuity strategies
Test and validate RPO/RTO targets regularly, ensuring operational readiness and audit compliance
Cost Management \& FinOps
Monitor and optimize cloud resource utilization through data-driven FinOps practices
Collaborate with finance and engineering stakeholders to improve cost visibility and accountability
Mentorship, Collaboration \& Knowledge Sharing
Mentor peers and junior engineers through design reviews, code reviews, and paired work
Lead by example in documentation, automation quality, and technical decision-making
Partner with cross-functional teams to align reliability initiatives with product and business objectives
Contribute to a culture of continuous learning and operational excellence
Qualifications:
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
5+ years of experience as a Site Reliability Engineer or in a similar role, working with highly available and production environments.
Proficiency in AWS and containerization technologies like Kubernetes and Docker.
Strong experience with Infrastructure as Code (IaC) using Terraform, with automation scripting skills in Python, Bash/Shell, or Go.
Deep knowledge of Linux/Unix systems and networking fundamentals (e.g., TCP/IP, DNS, HTTP, VPN).
Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana) and incident management.
Familiarity with CI/CD pipelines, preferably using tools like GitLab, and strong knowledge of DevOps practices.
Excellent troubleshooting skills, with experience in performance optimization and root cause analysis.
Strong communication and collaboration skills.
Bonus Skills:
Tools: Rundeck, Vector, Loki, VictoriaMetrics
Frameworks: Java, Spring, Go
Multi-cloud experience (Azure, GCP)
Certifications: AWS Solutions Architect, Certified Kubernetes Administrator (CKA)
What Success Looks Like
Core services consistently meet or exceed SLOs and error budgets
Infrastructure deployments are automated, reproducible, and observable
Cost efficiency and system performance improve through data-driven insights
Post-incident reviews lead to measurable reliability gains
The team benefits from your mentorship, leadership, and technical influence
Classmates
Classmates is the premier online, social, and mobile destination for reconnecting with the people from your high school years. Classmates offers the largest digitized collection of high school yearbooks online, with over 450,000 available to view, tag, sign, and share, and has the most comprehensive directory of high schools and class lists from the 1940s to today.
Salary Range:
Min: $152,700
Mid: $170,800
Max: $190,600
The pay range reflects the salary amount the Company reasonably expects to pay for the position. It is not a guarantee of actual compensation or a specific payment amount to any candidate. The actual compensation will depend on numerous factors including, without limitation, a particular candidate’s experience and qualifications.
The Company's Applicant and Worker Privacy Notice can be found here.
PeopleConnect is an equal opportunity employer.
Local area candidates are encouraged to apply, and please note we are not able to offer visa sponsorship, visa transfer, or corp-corp arrangements.
Note for Principal Agencies - Principal agents should not forward resumes to PeopleConnect, as we will not be responsible for any fees arising from the use of resumes submitted from agencies without a prior written and signed agreement and authorized job order for this position in place.
PeopleConnect, Inc. is an equal opportunity employer