We are looking for a Lead AI Platform Engineer to architect, deploy, and manage scalable Databricks platforms on AWS that support advanced ML and analytics pipelines.
In this role, you will work closely with data scientists and ML engineers to enhance the Lakehouse developer environment and drive innovation in AI infrastructure. Join us to lead the development of state-of-the-art AI platform solutions.
Responsibilities
Architect and deploy scalable Databricks platform solutions for analytics, machine learning, and GenAI workflows across multiple environments
Manage and enhance Databricks workspaces, including cluster policies, autoscaling, GPU compute, and job clusters
Oversee Unity Catalog governance by managing metastores, catalogs, schemas, data sharing, masking, lineage, and access control
Develop and maintain Infrastructure as Code with Terraform to enable automated, consistent platform provisioning
Establish CI/CD pipelines for notebooks, libraries, DLT processes, and ML assets using GitHub Actions and Databricks APIs
Standardize experiment tracking and model registry workflows with MLflow and manage model serving endpoints with monitoring and rollback
Optimize Delta Lake batch and streaming pipelines using Auto Loader, Structured Streaming, and DLT while ensuring data quality and SLA compliance
Collaborate with cross-functional teams to integrate platform features and deliver an exceptional developer experience
Monitor system performance, troubleshoot issues, and implement enhancements to guarantee platform reliability and scalability
Document platform operations and maintain automation runbooks for governance and support
Coordinate with security teams to enforce data governance, encryption, and compliance standards
Champion best practices in coding, testing, and deployment across the platform engineering team
Drive ongoing improvements in automation and operational efficiency for the platform
Engage stakeholders to capture requirements and provide expert technical guidance
Lead and mentor junior engineers, sharing expertise in platform technologies
Requirements
Proven expertise administering Databricks on AWS including Unity Catalog governance and enterprise integrations with at least 5 years in platform engineering
Comprehensive knowledge of AWS services such as VPC, IAM, KMS, S3, CloudWatch, and network architecture
Advanced skills with Terraform including the Databricks provider and experience with Infrastructure as Code for cloud environments
Strong proficiency in Python and SQL, including packaging libraries and managing notebooks and repositories
Experience using MLflow for experiment tracking, model registry, and model serving endpoints
Familiarity with Delta Lake, Auto Loader, Structured Streaming, and DLT technologies
Solid experience implementing DevOps automation, CI/CD pipelines, and using GitHub Actions or similar tools
Expertise in Git and GitHub, including code review processes and branching strategies
Working knowledge of REST APIs, Databricks CLI, and automation scripting
Excellent communication and stakeholder management abilities
Capacity to work autonomously and within distributed teams
Detail-focused with strong problem-solving and organizational skills
English language proficiency at B2 (Upper-Intermediate) level or above
Nice to have
Hands-on experience with AWS EKS and Kubernetes
Understanding of MLOps methodologies and pipeline automation
Knowledge of attribute-based access control and enhanced data governance frameworks
Experience with Secrets management and SSO/SCIM provisioning
Relevant certifications in AWS or Databricks platform engineering