👨🏻‍💻 postech.work

Lead AI Platform Engineer - Databricks & AWS

EPAM Systems • 🌐 Remote

Remote Posted 2 days, 19 hours ago

Job Description

We are looking for a Lead AI Platform Engineer to architect, deploy, and manage scalable Databricks platforms on AWS that support advanced ML and analytics pipelines.

In this role, you will work closely with data scientists and ML engineers to enhance the Lakehouse developer environment and drive innovation in AI infrastructure. Join us to lead the development of state-of-the-art AI platform solutions.

Responsibilities

Architect and deploy scalable Databricks platform solutions for analytics, machine learning, and GenAI workflows across multiple environments

Manage and enhance Databricks workspaces, including cluster policies, autoscaling, GPU compute, and job clusters

Oversee Unity Catalog governance by managing metastores, catalogs, schemas, data sharing, masking, lineage, and access control

Develop and maintain Infrastructure as Code with Terraform to enable automated, consistent platform provisioning

Establish CI/CD pipelines for notebooks, libraries, DLT processes, and ML assets using GitHub Actions and Databricks APIs

Standardize experiment tracking and model registry workflows with MLflow and manage model serving endpoints with monitoring and rollback

Optimize Delta Lake batch and streaming pipelines using Auto Loader, Structured Streaming, and DLT while ensuring data quality and SLA compliance

Collaborate with cross-functional teams to integrate platform features and deliver an exceptional developer experience

Monitor system performance, troubleshoot issues, and implement enhancements to guarantee platform reliability and scalability

Document platform operations and maintain automation runbooks for governance and support

Coordinate with security teams to enforce data governance, encryption, and compliance standards

Champion best practices in coding, testing, and deployment across the platform engineering team

Drive ongoing improvements in automation and operational efficiency for the platform

Engage stakeholders to capture requirements and provide expert technical guidance

Lead and mentor junior engineers, sharing expertise in platform technologies

Requirements

Proven expertise administering Databricks on AWS including Unity Catalog governance and enterprise integrations with at least 5 years in platform engineering

Comprehensive knowledge of AWS services such as VPC, IAM, KMS, S3, CloudWatch, and network architecture

Advanced skills with Terraform including the Databricks provider and experience with Infrastructure as Code for cloud environments

Strong proficiency in Python and SQL, including packaging libraries and managing notebooks and repositories

Experience using MLflow for experiment tracking, model registry, and model serving endpoints

Familiarity with Delta Lake, Auto Loader, Structured Streaming, and DLT technologies

Solid experience implementing DevOps automation, CI/CD pipelines, and using GitHub Actions or similar tools

Expertise in Git and GitHub, including code review processes and branching strategies

Working knowledge of REST APIs, Databricks CLI, and automation scripting

Excellent communication and stakeholder management abilities

Capacity to work autonomously and within distributed teams

Detail-focused with strong problem-solving and organizational skills

English language proficiency at B2 (Upper-Intermediate) level or above

Nice to have

Hands-on experience with AWS EKS and Kubernetes

Understanding of MLOps methodologies and pipeline automation

Knowledge of attribute-based access control and enhanced data governance frameworks

Experience with Secrets management and SSO/SCIM provisioning

Relevant certifications in AWS or Databricks platform engineering

Get job updates in your inbox

Subscribe to our newsletter and stay updated with the best job opportunities.