EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Join our team as a

Site Reliability Engineer

, where you will ensure system reliability, manage incident responses, and enable seamless collaboration between operations and development teams.

This role demands a background in Oil \& Gas combined with expertise in automation and cloud technologies. Apply now to support critical infrastructure and drive operational excellence.

Responsibilities

Oversee and enhance the product monitoring system

Handle incidents, including troubleshooting, resolution, documentation, and analysis

Distribute knowledge and insights across teams

Facilitate collaboration between operations and development

Create automation for log analysis, testing production systems, and alerting

Track system health, performance, and SLIs/SLOs/SLAs

Maintain documentation for incident management procedures

Conduct incident analyses and implement corrective actions

Respond to on-call support requests during and after business hours

Collaborate with teams to enhance system efficiency and reliability

Leverage tools such as PagerDuty, ELK/Kibana, SEQ logging, Prometheus, and Grafana for system monitoring

Develop scripts and implement automation solutions using Python, C#, and Bash

Manage orchestration and infrastructure through SaltStack and Docker

Support project workflows using Azure DevOps and maintain a comprehensive Wiki

Maintain code repositories and implement version control systems using Git

Requirements

1+ years of experience in creating solutions, particularly in Site Reliability Engineering

Expertise in cloud services and automation scripting with Python and Bash

Background in Oil \& Gas operations and incident handling

Skill in managing incident responses and providing on-call support

Familiarity with monitoring tools such as Prometheus and Grafana

Proficiency in logging tools like ELK/Kibana and SEQ logging

Knowledge of orchestration and infrastructure solutions including SaltStack and Docker

Understanding of fundamental networking concepts like inbound/outbound rules and firewalls

Proficiency in tools for project management and issue tracking like Azure DevOps

Capability to manage source code with Git

Strong skills in creating documentation and disseminating knowledge

Competency in conducting detailed post-incident reviews

Excellent troubleshooting abilities and problem-solving skills

Effective communication skills, with an English level of at least B2

Nice to have

Experience using PagerDuty for incident handling

Competency in C# programming

Understanding of SQL and MongoDB databases

Background in Zededa infrastructure

Experience in supporting Oil \& Gas field operations

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Site Reliability Engineer – Azure DevOps

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox