Job Description There are over 7000 rare diseases identified, affecting over 300 million patients worldwide and 1 in 12 patients in Canada. Many of these patients remain undiagnosed and unaware, resulting in a poor quality of life and potentially serious consequences. Healwell AI (HWAI) (TSX:AIDX), is a leader in AI-enabled clinical intelligence for rare diseases and specialty conditions. Through our proprietary clinical intelligence platform and deep analytical tools, HWAI allows physicians to quickly understand complex, high-risk patients and place them on the right care pathways leading to better outcomes for patients, their families, and the healthcare system.
HWAI is looking for We are seeking an experienced MLOps Tech Lead to architect our next-generation AI infrastructure and lead a talented team of engineers. In this pivotal role, you will bridge the gap between Data Science, Cloud Engineering, and DevOps. You will not only be hands-on with our Azure/Databricks stack but will also set the technical vision, establish engineering standards, and ensure our AI platforms are secure, scalable, and cost-efficient. You will own the roadmap for our MLOps maturity, moving us from manual execution to fully automated, observable, and resilient AI systems. You will have the opportunity to enhance your technical leadership skills while contributing to impactful projects in the healthcare space.
Responsibilities The successful candidate will work in a multifaceted role encompassing Cloud Architect, Cloud Security, and DevOps/MLOps responsibilities
Lead, mentor, and grow a team of MLOps and Cloud Engineers; conduct code reviews, facilitate technical design sessions, and foster a culture of engineering excellence.
Define the high-level architecture for our end-to-end ML platform on Azure, making critical decisions on "build vs. buy" for tooling and infrastructure.
Oversee the Terraform codebase; implement modular, reusable infrastructure patterns and enforce state management policies to prevent drift.
Own the reliability (SRE) of machine learning systems. Define SLAs/SLOs for model inference and data pipelines, and lead root cause analysis (RCA) for critical incidents.
Manage cloud budgets (FinOps) for compute/Databricks usage and enforce rigorous security postures (IAM, network isolation, private endpoints) ensuring compliance with industry standards
Evolve our CI/CD pipelines from simple automation to advanced deployment strategies (Blue/Green, Canary releases, Shadow deployment) for ML models.
Deploy and maintain cloud-based ML models in production, ensuring performance and scalability
Design, deploy, and manage scalable, secure, and highly available cloud infrastructure on Azure, utilizing infrastructure as code (IaC) principles.
Build monitoring systems for data quality, model performance, and pipeline health
Collaborate with cross-functional teams to define problems and develop solutions
Develop and maintain documentation for cloud architecture, processes, and systems
Diagnose and resolve issues related to application and model performance, pipeline failures, and infrastructure problems.
Required Qualifications
Bachelor’s degree in computer science, Engineering, or related field
7+ years of total experience in DevOps, Cloud Engineering, or Software Engineering.
3+ years specifically focused on MLOps or Data Engineering at a production scale.
2+ years in a technical leadership or mentoring role (Team Lead, Principal Engineer, etc.).
Deep proficiency with Azure cloud and cloud-native services
Proficiency in Python and shell scripting
Hands-on experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
Advanced mastery of Terraform
Deep hands-on experience with Databricks (MLflow, Spark, Unity Catalog)
Proven experience with orchestration tools (Dagster preferred)
Knowledge of Postgres or equivalent database management
Experience with containerization, infrastructure as code, and DevOps/MLOps practices
Strong problem-solving skills and ability to work independently and collaboratively
Preferred Qualifications
Certifications like Azure Solutions Architect Expert or DevOps Engineer Expert are desirable
Relevant certifications in security domains.
What You'll Work With
Data Platform: Databricks (Spark, Delta Lake) + Weaviate vector store
Orchestration: Dagster for pipeline management and scheduling
Cloud: Azure services for compute, storage, and ML services
Languages: Python, shell
Tools: Docker, Kubernetes, Terraform, Git, CI/CD pipelines
Monitoring: Custom dashboards, alerting systems, and model performance tracking
Culture \& Work Environment
Communication: We value open and honest communication. Regular check-ins and team meetings ensure everyone is aligned and informed.
Transparency: Our decision-making processes are transparent, encouraging input from all team members. Your ideas and feedback will be valued.
Promptness: We maintain a fast-paced work environment and expect team members to be prompt in delivering work and meeting deadlines.
Guidance: You will be supported and guided by our VP of Technology, who will provide mentorship and direction throughout your co-op experience.
What We Offer
Hands-on experience with real-world data challenges in the medical field.
Opportunities to expand your technical skill set and work with advanced AI tools.
A collaborative team environment that fosters learning and innovation.
We look forward to receiving your application and hope to welcome you to the HWAI team!
HWAI is an equal opportunity employer that welcomes all applicants including persons with disabilities, visible minorities, women, and aboriginals. HWAI will provide reasonable accommodation to qualified job applicants with a disability, on request, and will notify successful applicants of policies relating to the accommodation of employees with disabilities. We would like to thank all applicants for your interest in HWAI, but please note that only successful candidates will be contacted.
You can learn more about HWAI at https://healwell.ai