Summary

The SRE team is responsible for availability, reliability, performance, monitoring, change-management, emergency response for infrastructure or applications, and reducing manual work by implementing SRE principles and practices. SRE team directly works with Devs/DevOps teams, Operations teams, Product teams, and other teams to deploy new features, changes, and maintain infrastructure, operations, CI/CD, IAC to achieve availability and reliability so that SLOs and SLAs can be protected. We utilize a variety of DevOps automation tools like Ansible, Docker, Kubernetes, Terraform, Jenkins, along with cloud vendor-specific tools like ECS, Cloudformation, EKS, Opsworks, beanstalk. The SRE engineer is capable of implementing Observability, SLO, SLI, SLA, and Disaster Recovery and Backup Plans in cloud environments mainly AWS.

Deliverables

Key Responsibilities:

Ensure the availability and reliability of distributed systems

Help the L1 team to resolve the client’s infrastructure/system issues, escalations, alerts, tickets, and queries

Works as a bridge between DevOps and other teams in order to build maintain resilient systems

Conduct, coordinate and oversee post incident Root Cause Analysis / Reviews

Build and maintain documentation for all assigned clients / projects

Leverage DevOps, Agile methodology, and standards in day-to-day work

Adopt and propose automation of repetitive tasks to reduce/eliminate toil

Implement and troubleshoot using observability tools like Datadog, New Relic, Splunk, CloudWatch etc

Adopt and ensure the SRE practices in Team

Maintenance of AWS managed resources, CI/CD, IAC

Planning and implementing disaster recovery and backup plans for AWS cloud platforms

Proactively work on efficiency and capacity planning

Keep a proactive approach to spotting problems, areas for improvement, and performance bottlenecks

Liaise and work closely with Layer-1 Oncall support, DevOps and Operations teams

Drive availability and reliability by defining and implementing SLI, SLO, error budget, Observability, Disaster recovery, and backup to detect and mitigate issues

Qualifications:

Bachelor’s degree in computer science (preferred) or equivalent management, technical, scientific discipline

Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript

Clear understanding of SRE principles and practices and Agile and DevOps methodologies

Experience in AWS Well-Architected framework in order to implement the scalable and reliable infrastructure

Great team player with flexibility to work

Excellent written/verbal communication and leadership skills

H5dFTrEGlY

Site Reliability Engineer - Bilingual - Portuguese and English

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox