About The Role Raydian Cloud is seeking a forward-thinking DevOps Engineer to help build and scale infrastructure that powers cutting-edge AI workloads. You’ll work at the intersection of cloud-native technologies and Artificial Intelligence operations (AIOps), enabling high-performance, secure, and automated environments for AI development and deployment. Your expertise in Infrastructure as Code and Kubernetes will be critical in supporting scalable AI pipelines and platform services.
Key Responsibilities
Design and manage cloud infrastructure optimized for AI/ML workloads using Infrastructure as Code (Terraform, Pulumi, etc.)
Deploy and maintain Kubernetes clusters tailored for GPU scheduling, distributed training, and inference workloads
Build CI/CD pipelines for AI model training, validation, and deployment across environments
Collaborate with data scientists and ML engineers to streamline model lifecycle management
Implement observability and monitoring for AI services (e.g., Prometheus, Grafana, OpenTelemetry)
Ensure infrastructure security, compliance, and cost-efficiency in multi-tenant AI environments
Automate provisioning of AI-specific resources (e.g., GPU nodes, storage volumes, feature stores)
Document infrastructure patterns, DevOps workflows, and platform architecture
Required Skills \& Qualifications
Strong experience with Kubernetes, including GPU scheduling and Helm
Proficiency in Infrastructure as Code tools (Terraform, Pulumi, etc.)
Familiarity with cloud platforms (AWS, Azure, GCP) and AI services (e.g., SageMaker, Vertex AI)
Experience with CI/CD tools (GitHub Actions, GitLab CI, Argo Workflows)
Scripting skills in Python, Bash, or Go
Understanding of ML model lifecycle and data pipeline orchestration
Excellent communication and collaboration skills across technical and business teams
Nice to Have
Experience with Kubeflow, MLflow, or similar MLOps frameworks
Knowledge of containerized AI workloads (e.g., TensorFlow Serving, Triton Inference Server)
Familiarity with service mesh technologies (Istio, Linkerd) in AI microservices
Certifications in Kubernetes or cloud platforms (CKA, AWS DevOps Engineer)
Why Join Raydian Cloud?
Shape the future of AI infrastructure and platform services
Work with a visionary team blending deep tech and strategic execution
Influence architecture decisions in a fast-moving AI startup environment
Competitive compensation, flexible work culture, and growth opportunities