👨🏻‍💻 postech.work

Site Reliability Engineer

xAI • 🌐 In Person

In Person Posted 2 months ago

Job Description

As a Data Center Site Reliability Engineer (SRE) at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our state\-of\-the\-art data center infrastructure, including the Colossus supercluster in Memphis—the world's largest AI training cluster with over 100,000 liquid\-cooled Nvidia GPUs and plans for expansion to 1 million. This infrastructure powers advanced AI workloads, massive\-scale model training, and products like Grok, enabling breakthroughs in understanding the universe. You will collaborate with cross\-functional teams to automate operations, enhance observability, and maintain high availability for large\-scale distributed systems. This is a hands\-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of AI, data center operations, and software reliability. Key Responsibilities include maintaining and improving reliability and uptime of on\-premises and cloud\-based data center environments, designing and managing monitoring, logging, and alerting systems, developing infrastructure\-as\-code and continuous deployment pipelines, participating in on\-call rotations and incident response, analyzing system performance and optimizing resource utilization, collaborating with hardware, networking, and software teams to design resilient solutions, creating documentation and standard operating procedures, and contributing to the efficiency of AI training pipelines. Required qualifications include a Bachelor’s degree or equivalent experience, 5\+ years in site reliability engineering or related fields, expert knowledge of Kubernetes, infrastructure\-as\-code tools, CI/CD systems, proficiency in systems programming languages, strong troubleshooting skills, incident response experience, and excellent communication skills. Preferred qualifications include experience with AI/ML workloads, familiarity with data center systems, certifications, and experience with both on\-premises and cloud infrastructure at scale.

Get job updates in your inbox

Subscribe to our newsletter and stay updated with the best job opportunities.