Site Reliability Engineer – AI, GPU \& Kubernetes Infrastructure
Australian Citizens or Permanent Residents will be only be considered due to customers requirements.
Overview
Our client is a stealth-mode hyperscale data center company building a next-generation AI and cloud platform powered by
thousands of NVIDIA GPUs
. The platform is designed to support frontier AI workloads, including large-scale model training, experimentation, and high-throughput inference.
This role represents responsibility for reliability, performance, and operational excellence across a large-scale GPU environment. The successful candidate will play a critical role in ensuring the stability and scalability of one of the most advanced private AI infrastructure platforms in production.
Key Responsibilities
Deploy, and operate
hyperscale GPU clusters
optimized for AI training and inference workloads across a Kubernetes \& Virtualized environment.
Own
Kubernetes orchestration
for GPU workloads, including scheduling efficiency, capacity planning, and fault tolerance.
Build
automation-driven systems
for provisioning, scaling, and managing GPU infrastructure across hundreds of nodes.
Develop and maintain
observability, alerting, and auto-remediation frameworks
to support high availability and performance.
Collaborate closely with ML, platform, and networking teams to optimize
GPU utilization, throughput, and data movement
.
Implement and enforce
Infrastructure as Code, CI/CD pipelines, and operational reliability standards
.
Diagnose complex performance and reliability issues across compute, networking, and storage layers.
Act as a regional point of ownership, providing
clear communication and operational leadership
during incidents and reviews.
Expectations
Demonstrated ability to operate independently in high-impact environments.
Clear, concise communicator, particularly during incidents or critical operational events.
Strong sense of accountability and pride in system reliability and operational quality.
Proactive in identifying risks and driving continuous improvement.
Required Experience
2+ years of experience in SRE, infrastructure, or platform engineering roles supporting large-scale compute environments (GPU, AI focused)
Deep hands-on expertise with
Kubernetes
in production, particularly for GPU-backed or high-performance workloads.
Proven experience designing or operating
GPU infrastructure at scale
.
Strong proficiency with
Infrastructure as Code
tools such as Terraform or Pulumi.
Programming experience in
Python, Go, or Bash
for automation and tooling.
Experience with
observability platforms and incident response
(Prometheus, Grafana, Loki, etc.).
Demonstrated interest or passion for
AI, ML systems, or GPU-centric infrastructure
.
NV1 and NV2 clearance eligible
(desirable but not needed)
Benefits
Competitive compensation with
equity participation
.
Remote Working or Hybrid working offered
Opportunity to operate and scale
cutting-edge AI infrastructure
in a high-impact role.
If interested, please apply or reach out to mitchell.cole@hamilton-barnes.com