EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
Join our team as a
Site Reliability Engineer
, where you will ensure system reliability, manage incident responses, and enable seamless collaboration between operations and development teams.
This role demands a background in Oil \& Gas combined with expertise in automation and cloud technologies. Apply now to support critical infrastructure and drive operational excellence.
Responsibilities
Oversee and enhance the product monitoring system
Handle incidents, including troubleshooting, resolution, documentation, and analysis
Distribute knowledge and insights across teams
Facilitate collaboration between operations and development
Create automation for log analysis, testing production systems, and alerting
Track system health, performance, and SLIs/SLOs/SLAs
Maintain documentation for incident management procedures
Conduct incident analyses and implement corrective actions
Respond to on-call support requests during and after business hours
Collaborate with teams to enhance system efficiency and reliability
Leverage tools such as PagerDuty, ELK/Kibana, SEQ logging, Prometheus, and Grafana for system monitoring
Develop scripts and implement automation solutions using Python, C#, and Bash
Manage orchestration and infrastructure through SaltStack and Docker
Support project workflows using Azure DevOps and maintain a comprehensive Wiki
Maintain code repositories and implement version control systems using Git
Requirements
1+ years of experience in creating solutions, particularly in Site Reliability Engineering
Expertise in cloud services and automation scripting with Python and Bash
Background in Oil \& Gas operations and incident handling
Skill in managing incident responses and providing on-call support
Familiarity with monitoring tools such as Prometheus and Grafana
Proficiency in logging tools like ELK/Kibana and SEQ logging
Knowledge of orchestration and infrastructure solutions including SaltStack and Docker
Understanding of fundamental networking concepts like inbound/outbound rules and firewalls
Proficiency in tools for project management and issue tracking like Azure DevOps
Capability to manage source code with Git
Strong skills in creating documentation and disseminating knowledge
Competency in conducting detailed post-incident reviews
Excellent troubleshooting abilities and problem-solving skills
Effective communication skills, with an English level of at least B2
Nice to have
Experience using PagerDuty for incident handling
Competency in C# programming
Understanding of SQL and MongoDB databases
Background in Zededa infrastructure
Experience in supporting Oil \& Gas field operations
We offer
International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn