Overview

A Site Reliability Engineer ensures the reliability, scalability, and performance of systems and services. They bridge the gap between development and operations by applying software engineering principles to infrastructure and operations problems.

Key Responsibilities

System Reliability \& Performance

Design, build, and maintain scalable and highly available systems.

Monitor system health and performance using observability tools.

Incident Management

Respond to production incidents, perform root cause analysis, and implement preventive measures.

Automation

Develop scripts and tools to automate repetitive tasks and improve efficiency.

Capacity Planning

Forecast system demands and plan for scaling infrastructure.

Collaboration

Work closely with development teams to ensure reliability is built into applications.

Security \& Compliance

Implement best practices for system security and compliance.

Required Skills

Strong knowledge of

Linux/Unix systems

and networking fundamentals.

Proficiency in

programming/scripting languages

(Python, Go, Bash).

Experience with

cloud platforms

(AWS, Azure, GCP).

Familiarity with

CI/CD pipelines

and

DevOps practices

Expertise in

monitoring tools

(Prometheus, Grafana, ELK stack).

Understanding of

containerization and orchestration

(Docker, Kubernetes).

Qualifications

Bachelor’s degree in Computer Science, Engineering, or related field.

3+ years of experience in system administration, DevOps, or SRE roles.

Strong problem-solving and troubleshooting skills.

Preferred

Experience with

Infrastructure as Code

(Terraform, Ansible).

Knowledge of

distributed systems

and

microservices architecture

Site Reliability Engineer

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox