Responsibilities:
Pipeline Engineering (PySpark on AWS):
Design, implement, and optimize batch/near-real-time ETL/ELT pipelines using PySpark on services such as AWS Glue; ensure code quality, reusability, and performance at scale.
Workflow Orchestration:
Build and maintain DAGs with Apache Airflow (e.g., AWS MWAA) to schedule, monitor, and recover workflows; implement alerting, retries, and SLA handling.
Data Storage \& Modeling:
Design efficient schemas (dimensional and/or data-vault/lakehouse), manage RDS PostgreSQL performance (indexes, partitioning, VACUUM/ANALYZE), and integrate with S3/Athena where appropriate.
Reliability \& Observability:
Instrument pipelines with metrics/logs; tune PySpark jobs (partitions, shuffle strategies, broadcast joins, caching) and optimize Glue job DPUs/cost.
Security \& Governance:
Apply IAM least-privilege, encryption (KMS), tagging, and data masking/pseudonymization; collaborate with data governance on lineage, metadata, and quality controls.
CI/CD \& Automation:
Use Git-based workflows; automate build/test/deploy of data jobs and infrastructure changes (IaC where applicable).
Stakeholder Collaboration:
Engage product owners, analysts, and downstream consumers; translate requirements into robust data solutions; document runbooks and provide clear status updates.
Mentorship \& Standards:
Review code, coach engineers, and contribute to engineering standards and best practices.
Required Qualifications:
Hands-on PySpark expertise (DataFrame API, performance tuning, job debugging).
Strong knowledge of AWS data services, particularly Glue, Airflow (operator/DAG design), RDS PostgreSQL, S3, and CloudWatch.
Proficiency in SQL and Python for data engineering (testing, packaging, dependency management).
Experience operating production pipelines: monitoring, incident response, and RCA.
Communication: Excellent written and verbal English; able to explain complex topics clearly to technical and non-technical audiences.
Preferred / Nice to Have:
Financial Services domain experience (e.g., regulatory data controls, privacy, compliance).
Cantonese speaking skills (is a plus).
Lakehouse patterns (e.g., Apache Iceberg), query engines (Athena/Presto), and data cataloging.
Performance tuning in PostgreSQL (EXPLAIN/ANALYZE, indexing strategies).
Experience with AWS services such as DMS, Lake Formation, and Glue Data Catalog.
DevOps tooling (GitLab/Jenkins).
Key Success Metrics:
Pipeline delivery on schedule and within budget (SLA adherence, MTTR/MTTA).
Data quality and reliability (validation coverage, defect escape rate).
Efficiency and cost optimization (DPU hours, storage/query costs).
Stakeholder satisfaction and adoption of delivered datasets.
Contribution to standards, documentation quality, and team mentorship.