Hiring: (Senior) Site Reliability Engineer
*AWS/GCP \| Terraform \| Python \| Docker \| Kubernetes \|
(Remote worldwide, CET working practice)
We’re looking for a
(Senior) SRE
to own and scale the cloud infrastructure behind an AI platform powering high-throughput ML workloads.
As a
Site Reliability Engineer (SRE)
, you’ll take ownership of the cloud infrastructure powering high-throughput ML training and large-scale inference.
What you’ll do:
Build and operate
high-availability
,
high-throughput
systems for ML training \& inference
Own cloud infrastructure with
Terraform
, AWS/GCP, and Infrastructure as Code
Implement
observability
, CI/CD pipelines, and automation to ensure reliability
Define SLOs/SLAs, lead incident response, and optimize for performance \& cost
Partner with ML and product teams to productionize
MLOps
workflows
What we’re looking for:
7+ years in SRE, DevOps, or Platform Engineering (4+ for mid-level)
Strong
Terraform
, AWS/GCP, Python, Docker \& Kubernetes experience
Proven track record delivering
highly reliable, scalable systems
Bonus: ML infra/MLOps experience, startup experience, observability stacks
Why join:
High ownership and technical influence
Work with cutting-edge AI at scale
Fast-paced, impact-driven environment with growth opportunities