👨🏻‍💻 postech.work

Site Reliability Engineer - AI/ML, Kubernetes

Hamilton Barnes 🌳 • 🌐 Remote

Remote Posted 2 days, 15 hours ago

Job Description

Site Reliability Engineer – AI, GPU \& Kubernetes Infrastructure

Australian Citizens or Permanent Residents will be only be considered due to customers requirements.

Overview

Our client is a stealth-mode hyperscale data center company building a next-generation AI and cloud platform powered by

thousands of NVIDIA GPUs

. The platform is designed to support frontier AI workloads, including large-scale model training, experimentation, and high-throughput inference.

This role represents responsibility for reliability, performance, and operational excellence across a large-scale GPU environment. The successful candidate will play a critical role in ensuring the stability and scalability of one of the most advanced private AI infrastructure platforms in production.

Key Responsibilities

Deploy, and operate

hyperscale GPU clusters

optimized for AI training and inference workloads across a Kubernetes \& Virtualized environment.

Own

Kubernetes orchestration

for GPU workloads, including scheduling efficiency, capacity planning, and fault tolerance.

Build

automation-driven systems

for provisioning, scaling, and managing GPU infrastructure across hundreds of nodes.

Develop and maintain

observability, alerting, and auto-remediation frameworks

to support high availability and performance.

Collaborate closely with ML, platform, and networking teams to optimize

GPU utilization, throughput, and data movement

.

Implement and enforce

Infrastructure as Code, CI/CD pipelines, and operational reliability standards

.

Diagnose complex performance and reliability issues across compute, networking, and storage layers.

Act as a regional point of ownership, providing

clear communication and operational leadership

during incidents and reviews.

Expectations

Demonstrated ability to operate independently in high-impact environments.

Clear, concise communicator, particularly during incidents or critical operational events.

Strong sense of accountability and pride in system reliability and operational quality.

Proactive in identifying risks and driving continuous improvement.

Required Experience

2+ years of experience in SRE, infrastructure, or platform engineering roles supporting large-scale compute environments (GPU, AI focused)

Deep hands-on expertise with

Kubernetes

in production, particularly for GPU-backed or high-performance workloads.

Proven experience designing or operating

GPU infrastructure at scale

.

Strong proficiency with

Infrastructure as Code

tools such as Terraform or Pulumi.

Programming experience in

Python, Go, or Bash

for automation and tooling.

Experience with

observability platforms and incident response

(Prometheus, Grafana, Loki, etc.).

Demonstrated interest or passion for

AI, ML systems, or GPU-centric infrastructure

.

NV1 and NV2 clearance eligible

(desirable but not needed)

Benefits

Competitive compensation with

equity participation

.

Remote Working or Hybrid working offered

Opportunity to operate and scale

cutting-edge AI infrastructure

in a high-impact role.

If interested, please apply or reach out to mitchell.cole@hamilton-barnes.com

Get job updates in your inbox

Subscribe to our newsletter and stay updated with the best job opportunities.