Senior DevOps Software Engineer / SRE
AI \& LLMsWe’re looking for a skilled Senior DevOps Engineer to build and manage the infrastructure powering our AI and Large Language Model (LLM) systems. You’ll design scalable cloud environments, automate deployments, and streamline pipelines that accelerate model training and inference across distributed platforms. If you enjoy solving complex infrastructure challenges for next-gen AI, this role is for you.
Key Responsibilities
Architect and maintain AI/LLM infrastructure on AWS, GCP, or OpenStack, including GPU clusters and scalable compute environments.
Develop CI/CD pipelines and automate deployments using Terraform, Ansible, Jenkins, and GitLab CI.
Manage containerized systems with Docker and Kubernetes for seamless model training, fine-tuning, and serving.
Implement monitoring, observability, and security for cloud and model environments using tools like Prometheus, Grafana, and Splunk.
Collaborate with software and ML teams to support efficient model versioning, inference APIs, and continuous delivery cycles.
Qualifications
Strong background in DevOps/MLOps with experience supporting AI or LLM workloads.
Expertise with cloud (AWS/GCP/OpenStack) and containerization (Docker/Kubernetes).
Proficiency in scripting (Python, Bash, or PowerShell) and automation tools (Terraform, Ansible, Chef).
Familiarity with distributed training (Ray, DeepSpeed, or Horovod) and inference frameworks (Triton, FastAPI).
Excellent communication and problem-solving skills in fast-paced, cross-functional environments.
Job Type: Full-time
Pay: $200,093.56 - $250,628.80 per year
Benefits:
401(k) matching
Dental insurance
Health insurance
Life insurance
Parental leave
Retirement plan
Vision insurance
Work Location: Remote