About Fusemachines
Fusemachines is a leading AI strategy, talent, and education services provider. Founded by Sameer Maskey Ph.D., Adjunct Associate Professor at Columbia University, Fusemachines has a core mission of democratizing AI. With a presence in 4 countries (Nepal, United States, Canada, and Dominican Republic and more than 450 full-time employees). Fusemachines seeks to bring its global expertise in AI to transform companies around the world.
Type: Full-time, Remote
About The Role
This is a remote full-time position responsible for designing, building, testing, optimizing and maintaining the infrastructure and code required for data integration, storage, processing, pipelines and analytics (BI, visualization and Advanced Analytics) from ingestion to consumption, implementing data flow controls, and ensuring high data quality and accessibility for analytics and business intelligence purposes. This role requires a strong foundation in programming, and a keen understanding of how to integrate and manage data effectively across various storage systems and technologies.
We are looking for a skilled Data Engineer with a strong background in Python, SQL, Pyspark and AWS cloud-based large scale data solutions with a passion for data quality, performance and cost optimization. The ideal candidate will develop in an Agile environment.
This role is perfect for an individual passionate about leveraging data to drive insights, improve decision-making, and support the strategic goals of the organization through innovative data engineering solutions.
Qualification \& Experience
Must have a full-time Bachelor's degree in Computer Science Information Systems, Engineering, or a related field
At least 2 years of experience as a data engineer with strong expertise in Python, SQL, PySpark and AWS in an Agile environment, with a proven track record of building and optimizing data pipelines, architectures, and datasets, and proven experience in data storage, modeling, management, lake, warehousing, processing/transformation, integration, cleansing, validation and analytics
2+ years of experience with DevOps tools and technologies: GitHub or AWS DevOps
Proven experience delivering large scale projects and products for Data and Analytics, as a data engineer within AWS
Preferred previous experience working with retail or other similar data models
Following certifications:
-
AWS Certified Cloud Practitioner
-
AWS Certified Data Engineer - Associate
-
Nice to have:
Databricks Certified Associate Developer for Apache Spark
Databricks Certified Data Engineer Associate
Required Skills/Competencies
Strong programming Skills in one or more object-oriented languages such as Python (must have), Scala, Java, and proficiency in writing high-quality, scalable, maintainable, efficient and optimized code for data integration, storage, processing, manipulation and analytics solutions.
Strong SQL skills and experience working with complex data sets, Enterprise Data Warehouse and writing advanced SQL queries. Proficient with Relational Databases (RDS, MySQL, Postgres, or similar) and NonSQL Databases (Cassandra, MongoDB, Neo4j, etc.)
Strong analytic skills related to working with structured and unstructured datasets
Thorough understanding of big data principles, techniques, and best practices
Experience with scalable and distributed Data Processing Technologies such as Spark/PySpark (must have including Spark SQL) and Kafka, to be able to handle large volumes of data
Experience with stream-processing systems: Storm, Spark-Streaming, etc. is a plus
Experience in implementing data pipelines and efficient ELT/ETL processes, batch and real-time, in AWS and using open source solutions, being able to develop custom integration solutions as needed, including Data Integration from different sources such as APIs (PoS integrations is a plus), ERP (Oracle and Allegra are a plus), databases, flat files, Apache Parquet, event streaming, including cleansing, transformation and validation of the data
Experience in data cleansing, transformation, and validation
Understanding of Data Modeling and Database Design Principles. Being able to implement efficient database schemas that meet the requirements to support data solutions. With good understanding of dimensional data modeling
Knowledge in cloud computing specifically in AWS services related to data and analytics, such as S3, EMR, Glue, SageMaker, RDS, Redshift, Lambda, Kinesis, Lake Formation, EC2, ECS/ECR, EKS, IAM, CloudWatch, etc. implementing Data Warehousing, data lake and data lake house, solutions in AWS
Experience in Orchestration using technologies like Azkaban, Luigi, Airflow, etc.
Good understanding of BI solutions including Looker and LookML (Looker Modeling Language)
Familiar with advanced analytics, AI/ML services and tools, and the ability to integrate advanced analytics, machine learning, and AI capabilities into data solutions, nice to have
Strong understanding of the software development lifecycle (SDLC), especially Agile methodologies
Knowledge of SDLC tools and technologies, including project management software (Jira or similar), source code management (GitHub, AWS CodeCommit or similar), CI/CD system (GitHub actions, Jenkins, AWS CodePipeline or similar) and binary repository manager (Sonatype Nexus, AWS CodeArtifact or similar).
Knowledge and hands-on experience of DevOps principles, tools and technologies (GitHub and AWS DevOps) including continuous integration, continuous delivery (CI/CD), infrastructure as code (IaC – Terraform), configuration management, automated testing, performance tuning and cost management and optimization
Knowledge of data structures and algorithms and good software engineering practices
Strong analytical skills to identify and address technical issues, performance bottlenecks, and system failures
Proficiency in debugging and troubleshooting issues in complex data and analytics environments and pipelines
Understanding of Data Quality and Governance, including implementation of data quality and integrity checks and monitoring processes to ensure that data is accurate, complete, and consistent.
Good Problem-Solving skills: being able to troubleshoot data processing pipelines and identify performance bottlenecks and other issues.
Strong interpersonal skills and ability to work with a wide range of stakeholders
Excellent communication skills to collaborate with cross-functional teams, including business users, data architects, DevOps/DataOps/MLOps engineers, data analyst, data scientists, developers, and operations teams. Essential to convey complex technical concepts and insights to non-technical stakeholders effectively
Ability to document processes, procedures, and deployment configurations
Understanding of security practices, including network security groups, encryption, and compliance standards, and ability to implement security controls and best practices within data and analytics solutions, including proficient knowledge and working experience on various cloud security vulnerabilities and ways to mitigate them.
Self-motivated with the ability to work well in a team
Strong project management and organizational skills
A willingness to stay updated with the latest services, Data Engineering trends, and best practices in the field
Comfortable with picking up new technologies independently and working in a rapidly changing environment with ambiguous requirements
Care about architecture, observability, testing, and building reliable infrastructure and data pipelines
Responsibilities:
Design, implement, deploy, test and maintain highly scalable and efficient data architectures, defining and maintaining standards and best practices for data management independently with minimal guidance
Ensure systems meet business requirements and industry practices for data integrity, performance, and reliability
Integrate new data management technologies and software engineering tools into existing structures
Create custom software components and analytics applications
Employ a variety of languages and tools to marry systems together or try to hunt down opportunities to improve current processes
Evaluate and advise on technical aspects of open work requests in the data pipeline with the project team
Handle ELT/ETL processes, including data extraction, loading and transformation, from different sources ensuring consistency and quality
Transform and clean data for further analysis and storage
Design and optimize data models and schemas to support business requirements and analysis
Implement monitoring tools and systems to ensure the availability and performance of data systems.
Manage data security and access, ensuring confidentiality and integrity
Automate repetitive tasks and processes to improve operational efficiency
Collaborate with data science teams to establish pipelines and workflows for training, validation, deployment, and monitoring of machine learning models. Automate deployment and management of machine learning models in production environments
Contribute to data quality assurance efforts, such as implementing data validation checks and tests to ensure reliability, efficiency, accuracy, completeness and consistency of data
Test software solutions and meet product quality standards prior to release to QA
Ensure the reliability, scalability, and efficiency of data systems are maintained at all times. Identifying and resolving performance bottlenecks in pipelines due to data, queries and processing workflows to ensure efficient and timely data delivery
Work with DevOps teams to optimize resources
Assist in the configuration and management of data warehousing and data lake solutions
Collaborate closely with cross-functional teams including Product, Engineering, Data Scientists, and Analysts to thoroughly understand data requirements and provide data engineering support and extend the company’s data with third-party sources of information when needed
Takes ownership of storage layer, database management tasks, including schema design, indexing, and performance tuning
Evaluate and implement cutting-edge technologies and methodologies and continue learning and expanding skills in data engineering and cloud platforms, to improve and modernize existing data systems
Develop, design, and execute data governance strategies encompassing cataloging, lineage tracking, quality control, and data governance frameworks that align with current analytics demands and industry best practices working closely with Data Architect
Ensure technology solutions support the needs of the customer and/or organization
Define and document data engineering architectures, processes and data flows
Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.
Powered by JazzHR
XudWe5nO8A