Key Responsibilities
Design, build, and maintain robust, scalable, and secure data pipelines for batch and real-time data processing.
Develop and optimize ETL/ELT workflows to extract, transform, and load data from multiple sources.
Architect and implement data warehouses, data lakes, and lakehouse solutions on cloud or on-prem platforms.
Ensure data quality, lineage, governance, and versioning using metadata management tools.
Collaborate with Data Scientists, Analysts, and Software Engineers to deliver reliable and accessible data solutions.
Optimize SQL queries, data models, and storage layers for performance and cost efficiency.
Develop and maintain automation scripts for data ingestion, transformation, and orchestration.
Integrate and process large-scale data from APIs, flat files, streaming services, and legacy systems.
Implement data security, access control, and compliance standards (GDPR, ISO 27001) .
Monitor and troubleshoot data pipeline failures, latency, and performance bottlenecks .
Data Engineering \& ArchitectureStrong expertise in data modeling (dimensional/star/snowflake schemas) and data normalization techniques .
Proficient in ETL/ELT tools such as Apache NiFi, Talend, Informatica, SSIS, or Airbyte .
Advanced knowledge of SQL and distributed computing concepts .
Experience with data lake and warehouse technologies such as Snowflake, Redshift, BigQuery, Azure Synapse, or Databricks .
Deep understanding of data partitioning, indexing, and query optimization .
Big Data \& Distributed SystemsHands-on experience with Hadoop ecosystem (HDFS, Hive, HBase, Oozie, Sqoop) .
Proficiency in Apache Spark / PySpark for distributed data processing.
Exposure to streaming frameworks like Kafka, Flink, or Kinesis .
Familiarity with NoSQL databases such as MongoDB, Cassandra, or Elasticsearch .
Knowledge of data versioning and catalog systems (e.g., Delta Lake, Apache Hudi, Iceberg, or AWS Glue Data Catalog).
Programming \& AutomationStrong programming skills in Python , Scala , or Java for data manipulation and ETL automation.
Experience with API integration, REST/GraphQL, and data serialization formats (JSON, Parquet, Avro, ORC).
Proficient in shell scripting, automation, and orchestration tools (Apache Airflow, Prefect, or Luigi).
Cloud PlatformsExpertise in at least one cloud ecosystem:
AWS: S3, Redshift, Glue, EMR, Lambda, Athena, Kinesis
Azure: Data Factory, Synapse, Blob Storage, Databricks
GCP: BigQuery, Dataflow, Pub/Sub, Cloud Composer
Strong understanding of IAM, VPC, encryption, and data access policies within cloud environments.
Data Governance \& SecurityImplement and enforce data quality frameworks (DQ checks, profiling, validation rules) .
Knowledge of metadata management, lineage tracking , and master data management (MDM) .
Familiarity with role-based access control (RBAC) and data encryption mechanisms.
Preferred Skills
Experience with machine learning data pipelines (ML Ops) or feature store management .
Knowledge of containerization and orchestration tools (Docker, Kubernetes).
Familiarity with CI/CD pipelines for data deployment.
Exposure to business intelligence (BI) tools like Power BI, Tableau, or Looker for data delivery.
Understanding of data mesh or domain-driven data architecture principles .
Leadership \& Collaboration
Work closely with cross-functional teams to define data requirements and best practices.
Mentor junior engineers and enforce coding and documentation standards .
Provide technical input in data strategy, architecture reviews, and technology evaluations .
Collaborate with security and compliance teams to ensure data integrity and protection.
Qualifications
Bachelor’s or Master’s Degree in Computer Science, Data Engineering, Information Systems, or related field.
5–10 years of professional experience as a Data Engineer or similar role.
Professional Certifications preferred:
AWS Certified Data Analytics / Big Data Specialty
Microsoft Certified: Azure Data Engineer Associate
Google Professional Data Engineer
Databricks Certified Data Engineer