👨🏻‍💻 postech.work

Site Reliability Engineer

Procom • 🌐 In Person

Expired Posted 2 months, 2 weeks ago

This job posting has expired and is no longer accepting applications.

Job Description

Position purpose:

The Senior Site Reliability Engineer (SRE) ensures the reliability, scalability, and performance of cloud\-based software solutions. This role blends software engineering and systems administration to support and enhance critical infrastructure, working closely with development and operations teams to deliver secure and cost\-effective cloud environments.

Essential Duties and Responsibilities:

Cloud Infrastructure Architecture and Implementation:

Designs, builds, and maintains robust cloud infrastructure solutions using AWS and other cloud technologies.

Mentorship and Team Development: Provides

technical guidance and mentorship to junior SREs, promoting a culture of continuous learning and improvement.

Operational Efficiency and Automation:

Identifies and implements process improvements through automation and optimization to enhance reliability and reduce manual effort.

Performance and Reliability Management:

Develops and executes strategies to meet and exceed Service Level Objectives (SLOs) and Service Level Agreements (SLAs).

Incident Management:

Leads incident response efforts, perform root cause analysis, and implement preventive measures to minimize downtime.

Capacity Planning and System Optimization

: Proactively identifies performance bottlenecks, optimize resource utilization, and ensure system scalability.

Security and Compliance:

Implements cloud security best practices, including least\-privilege IAM policies, secrets management, and evidence generation for compliance frameworks (e.g., SOC 2, ISO 27001\).

Other duties and projects as assigned.

Required Skills/Abilities:

Strong problem\-solving, troubleshooting, and analytical skills.

Excellent communication and collaboration abilities.

Organizational skills with attention to detail.

Ability to manage time and prioritize tasks.

Proficiency in scripting languages (e.g., Python, PowerShell).

In\-depth knowledge of Linux systems, networking, load balancing, and security principles.

Experience:

5\+ years in Site Reliability Engineering or a similar role.

Extensive expertise in AWS (Amazon Web Services) cloud platform and services.

Experience with GitOps practices and CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD, or similar).

Experience with Infrastructure as Code (e.g., Terraform).

Experience designing and maintaining observability stacks (e.g., Prometheus, Grafana, ELK) with a focus on actionable metrics, alerting, and SLOs.

Get job updates in your inbox

Subscribe to our newsletter and stay updated with the best job opportunities.