Are you looking for a career that makes a positive difference in your life and reimagines learners and educators across the globe? Do you want to work with fun and social people in a positive and engaged virtual office environment?

We are hiring a Senior Site Reliability Engineer who will build and support reliable, high capacity, and well-performing systems in support of our mission to protect and improve our customer platforms, with an ever-watchful eye on reliability, security, performance, cost, and operational excellence.

We call this work Site Reliability Engineering.

As a Site Reliability Engineer within a small team, you will collaborate in a DevOps model with product development teams; designing, deploying, and managing automation tools that increase predictability as well as time to market while reducing cost.

Our stack

Code:, Java, PHP, Node, and GoLang

RDBMS: Oracle, PostGreSQL, MySQL

Cache: Couchbase, Redis, ElastiCache, DynamoDB

Containers: ECS, EKS, K8S, Docker

Cloud: Amazon AWS

Telemetry: New Relic, CloudWatch

Build: Jenkins, CircleCI, GitHub Actions

Run: PagerDuty, Exigence

Infrastructure-As-Code: Terraform, Cloudformation

Your contributions

Cloud Engineering

Hands-on design, analysis, development and troubleshooting of highly-distributed large-scale production systems and event-driven, cloud-based services

Ensure repeatability, traceability, and transparency of our infrastructure automation (infrastructure-as-code, monitoring-as-code)

Participate in continual learning of the AWS ecosystem, game day scenarios, and professional conferences

Collaborative solutioning of enterprise applications with development teams utilizing our software stack

Actively monitor AWS Cost, and utilize optimizer to maximize ROI while maintaining Service Level Objectives

Observability Engineering

Ownership of reliability, uptime, system security, cost, operations, capacity, resiliency and performance-analysis thereof

Define, monitor and report on service level indicators for applications workloads

Support on-call rotations for operational duties that have not been addressed with automation, with an eye for correcting issues that result in on-call alarms

Maintain telemetry that improve the visibility to our applications' performance and business metrics and keep operational workload in-check

Develop, communicate, collaborate, and monitor standard processes to promote the long-term health and sustainability of operational development tasks.

DevSecOps

Support healthy software development practices, including complying with agile software development methodology, building standards for code reviews, work packaging, and continuous delivery

Partner with CyberSecurity and develop plans and automation to respond to new risks and vulnerabilities

Systems Engineering

Collaborate with Systems Admins to coordinate middleware, network, storage, database, Windows, Linux, VMware maintenance

Automate legacy onprem system maintenance and migrate to cloud via thoughtful redesign

Resiliency Engineering

Collaborate with dev teams to identify failure points and blast radius of systems

Validate effectiveness of monitoring and observability configurations

Coordinate failure injection testing

Observe and document steady state production levels, growth patterns

Plan and forecast for seasonal growth, communicate trend lines with leadership, enhance infrastructure scaling plans to accommodate 2x planned load

Coordinate improvements of existing software and infrastructure to meet resiliency goals

Must Have for this role

We are looking for a senior reliability engineer who can work with the cross – functional teams.

The candidate must have strong experience in Terraform.

This person should have the capability to work with the stakeholders and should have the expereince in leading the P1 and P2 teams.

Candidates should come with the experience in support on-call rotations for operational duties that have not been addressed with automation, with an eye for correcting issues that result in on-call alarms.

Candidate must have EKS and K8S experience.

Candidate should be a good communicator.

Interviews

2 rounds of interviews will be conducted.

One with the hiring manager.

Panel interview with the team.

8281 - Site Reliability Engineer Cloud, Infrastructure and ITOps

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox