Responsibilities:

Pipeline Engineering (PySpark on AWS):

Design, implement, and optimize batch/near-real-time ETL/ELT pipelines using PySpark on services such as AWS Glue; ensure code quality, reusability, and performance at scale.

Workflow Orchestration:

Build and maintain DAGs with Apache Airflow (e.g., AWS MWAA) to schedule, monitor, and recover workflows; implement alerting, retries, and SLA handling.

Data Storage \& Modeling:

Design efficient schemas (dimensional and/or data-vault/lakehouse), manage RDS PostgreSQL performance (indexes, partitioning, VACUUM/ANALYZE), and integrate with S3/Athena where appropriate.

Reliability \& Observability:

Instrument pipelines with metrics/logs; tune PySpark jobs (partitions, shuffle strategies, broadcast joins, caching) and optimize Glue job DPUs/cost.

Security \& Governance:

Apply IAM least-privilege, encryption (KMS), tagging, and data masking/pseudonymization; collaborate with data governance on lineage, metadata, and quality controls.

CI/CD \& Automation:

Use Git-based workflows; automate build/test/deploy of data jobs and infrastructure changes (IaC where applicable).

Stakeholder Collaboration:

Engage product owners, analysts, and downstream consumers; translate requirements into robust data solutions; document runbooks and provide clear status updates.

Mentorship \& Standards:

Review code, coach engineers, and contribute to engineering standards and best practices.

Required Qualifications:

Hands-on PySpark expertise (DataFrame API, performance tuning, job debugging).

Strong knowledge of AWS data services, particularly Glue, Airflow (operator/DAG design), RDS PostgreSQL, S3, and CloudWatch.

Proficiency in SQL and Python for data engineering (testing, packaging, dependency management).

Experience operating production pipelines: monitoring, incident response, and RCA.

Communication: Excellent written and verbal English; able to explain complex topics clearly to technical and non-technical audiences.

Preferred / Nice to Have:

Financial Services domain experience (e.g., regulatory data controls, privacy, compliance).

Cantonese speaking skills (is a plus).

Lakehouse patterns (e.g., Apache Iceberg), query engines (Athena/Presto), and data cataloging.

Performance tuning in PostgreSQL (EXPLAIN/ANALYZE, indexing strategies).

Experience with AWS services such as DMS, Lake Formation, and Glue Data Catalog.

DevOps tooling (GitLab/Jenkins).

Key Success Metrics:

Pipeline delivery on schedule and within budget (SLA adherence, MTTR/MTTA).

Data quality and reliability (validation coverage, defect escape rate).

Efficiency and cost optimization (DPU hours, storage/query costs).

Stakeholder satisfaction and adoption of delivered datasets.

Contribution to standards, documentation quality, and team mentorship.

Senior Data Engineer (AWS) | multiple headcounts

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox