Roles and Responsibilities:
Implementation
:
Build and optimize ETL/ELT processes leveraging Databricks' native capabilities to handle large volumes of structured and unstructured data from various sources
Implement data quality frameworks and monitoring solutions using Databricks data quality features to ensure data accuracy and reliability across all data products
Establish best practices for data governance, security, and compliance within the Databricks ecosystem and integrate with enterprise systems
Operational Responsibilities:
Monitor and maintain production data pipelines to ensure 99.9% uptime and optimal performance across all Databricks workloads and clusters
Implement comprehensive logging, alerting, and monitoring systems using Databricks monitoring capabilities and integration with enterprise monitoring tools
Perform regular health checks on Databricks cluster performance, job execution times, and resource utilization to identify and resolve bottlenecks proactively
Manage incident response procedures for Databricks pipeline failures, including root cause analysis, resolution, and post-incident reviews
Establish and maintain disaster recovery procedures and backup strategies for critical data assets within the Databricks environment
Conduct regular performance tuning of Spark jobs and Databricks cluster configurations to optimize cost and execution efficiency
Implement automated testing frameworks for Databricks-based data pipelines, including unit tests, integration tests, and data validation checks
Maintain comprehensive documentation for all Databricks operational procedures, runbooks, and troubleshooting guides
Coordinate scheduled maintenance windows and Databricks system upgrades with minimal business impact
Manage user access controls, workspace configurations, and security policies within Databricks environments
Monitor data lineage using Databricks Unity Catalog and maintain metadata management systems to support operational transparency and compliance requirements
Establish capacity planning processes to forecast Databricks infrastructure needs and manage cloud costs effectively
Collaboration \& Leadership:
Provide technical guidance and mentorship to junior team members on Databricks best practices and data engineering principles
Participate in on-call rotation for critical production systems with focus on Databricks platform stability
Lead operational reviews and contribute to continuous improvement initiatives for Databricks platform reliability and efficiency
Coordinate with infrastructure teams on Databricks cluster provisioning, network configurations, and security implementations
Requirements / Qualifications:
Education \& Experience
:
Degree in Computer Science or Computer Engineering
Minimum 8-10 years working experience in system operations compliance and management areas
Project hands-on experience specifically with Databricks platform (primary requirement)
project experience in cloud operations or cloud architecture
Must be cloud certified (AWS)
Core Technical Skills:
Expert-level proficiency in Databricks platform, including workspace management, cluster configuration, and job orchestration
Strong expertise in Apache Spark within Databricks environment, including Spark SQL, DataFrames, and RDDs
Extensive experience with Delta Lake, including data versioning, time travel, and ACID transactions
Proficiency in Databricks Unity Catalog for data governance and metadata management
Good in-depth understanding of data warehouse concepts, data profiling, data verification and advanced analytics techniques
Strong knowledge of monitoring, incident management, and cloud cost control
Technology Stack Experience:
Databricks (primary and most critical skill)
AWS cloud services and architecture
IDMC (Informatica Data Management Cloud)
Tableau for data visualization
Oracle Database management
ML Ops practices within Databricks environment (Good to have)
STATA for statistical analysis is advantage (Good to have)
Amazon SageMaker integration with Databricks (Good to have)
DataRobot platform integration (Good to have)
Soft Skills \& Stakeholder Management:
Good interpersonal skills with the ability to work with different groups of stakeholders
Strong problem-solving skills and ability to work independently in a fast-paced environment with minimal supervision
Excellent communication skills for technical documentation and cross-team collaboration
Desirable Requirements:
Databricks certification (Associate or Professional level) - highly preferred
Exposure to hospital information/clinical systems is an added advantage
Understanding of DevOps practices and CI/CD pipelines for Databricks-based data engineering projects
Knowledge of ITIL frameworks and operational best practices
Renee Feng
EA License No: 11C5502
EAP No. R23111942