👨🏻‍💻 postech.work

DevOps Engineer – AI Infrastructure & Kubernetes - GRaduate Industry Traineeships (GRIT) Programme

• 🌐 In Person

In Person Posted 1 day, 8 hours ago

Job Description

About The Role Raydian Cloud is seeking a forward-thinking DevOps Engineer to help build and scale infrastructure that powers cutting-edge AI workloads. You’ll work at the intersection of cloud-native technologies and Artificial Intelligence operations (AIOps), enabling high-performance, secure, and automated environments for AI development and deployment. Your expertise in Infrastructure as Code and Kubernetes will be critical in supporting scalable AI pipelines and platform services.

Key Responsibilities

Design and manage cloud infrastructure optimized for AI/ML workloads using Infrastructure as Code (Terraform, Pulumi, etc.)

Deploy and maintain Kubernetes clusters tailored for GPU scheduling, distributed training, and inference workloads

Build CI/CD pipelines for AI model training, validation, and deployment across environments

Collaborate with data scientists and ML engineers to streamline model lifecycle management

Implement observability and monitoring for AI services (e.g., Prometheus, Grafana, OpenTelemetry)

Ensure infrastructure security, compliance, and cost-efficiency in multi-tenant AI environments

Automate provisioning of AI-specific resources (e.g., GPU nodes, storage volumes, feature stores)

Document infrastructure patterns, DevOps workflows, and platform architecture

Required Skills \& Qualifications

Strong experience with Kubernetes, including GPU scheduling and Helm

Proficiency in Infrastructure as Code tools (Terraform, Pulumi, etc.)

Familiarity with cloud platforms (AWS, Azure, GCP) and AI services (e.g., SageMaker, Vertex AI)

Experience with CI/CD tools (GitHub Actions, GitLab CI, Argo Workflows)

Scripting skills in Python, Bash, or Go

Understanding of ML model lifecycle and data pipeline orchestration

Excellent communication and collaboration skills across technical and business teams

Nice to Have

Experience with Kubeflow, MLflow, or similar MLOps frameworks

Knowledge of containerized AI workloads (e.g., TensorFlow Serving, Triton Inference Server)

Familiarity with service mesh technologies (Istio, Linkerd) in AI microservices

Certifications in Kubernetes or cloud platforms (CKA, AWS DevOps Engineer)

Why Join Raydian Cloud?

Shape the future of AI infrastructure and platform services

Work with a visionary team blending deep tech and strategic execution

Influence architecture decisions in a fast-moving AI startup environment

Competitive compensation, flexible work culture, and growth opportunities

Get job updates in your inbox

Subscribe to our newsletter and stay updated with the best job opportunities.