Hiring: (Senior) Site Reliability Engineer

(Remote worldwide, CET working practice)

We’re looking for a

(Senior) SRE

to own and scale the cloud infrastructure behind an AI platform powering high-throughput ML workloads.

As a

Site Reliability Engineer (SRE)

, you’ll take ownership of the cloud infrastructure powering high-throughput ML training and large-scale inference.

What you’ll do:

Build and operate

high-availability

high-throughput

systems for ML training \& inference

Own cloud infrastructure with

Terraform

, AWS/GCP, and Infrastructure as Code

Implement

observability

, CI/CD pipelines, and automation to ensure reliability

Define SLOs/SLAs, lead incident response, and optimize for performance \& cost

Partner with ML and product teams to productionize

MLOps

workflows

What we’re looking for:

7+ years in SRE, DevOps, or Platform Engineering (4+ for mid-level)

Strong

Terraform

, AWS/GCP, Python, Docker \& Kubernetes experience

Proven track record delivering

highly reliable, scalable systems

Bonus: ML infra/MLOps experience, startup experience, observability stacks

Why join:

High ownership and technical influence

Work with cutting-edge AI at scale

Fast-paced, impact-driven environment with growth opportunities

Site Reliability Engineer

Job Description