Job Title:

ML Data Engineer (RAG \& Vector Database Focused)

Working Type:

Full Time \| On-site

Location:

D3, HCMC, Vietnam

Overview

We are seeking an experienced Machine Learning Data Engineer with deep expertise in

Retrieval-Augmented Generation (RAG) systems and vector database–centric data

architectures. This role is data-first, not model-training–first.

The primary responsibility is to design, build, and maintain high-quality, scalable data

pipelines that power RAG systems, including dataset engineering, embedding lifecycle

management, and vector database operations.

The engineer will work closely with AI/ML systems but is not expected to perform model

training or act as an AI trainer. Model training and fine-tuning will be handled separately.

Key Responsibilities

RAG Data Engineering \& Architecture

Design and maintain RAG-ready data pipelines from raw data ingestion to retrieval-ready

corpora.

Prepare and structure datasets specifically for retrieval-based systems, ensuring data

quality, relevance, and consistency.

Implement document preprocessing workflows including:

cleaning

normalization

deduplication

segmentation and chunking strategies optimized for retrieval.

Architect scalable RAG data architectures, supporting incremental updates and long-term

maintainability.

Vector Database \& Embedding Lifecycle

Build and operate vector database pipelines (e.g., Qdrant or similar systems).

Design vector collections, indexing strategies, and payload/metadata schemas for

eﬃcient filtering and retrieval.

Manage the full embedding lifecycle, including:

embedding generation

versioning

re-embedding strategies

re-indexing workflows.

Optimize retrieval performance through proper data modeling, metadata usage, and

vector index tuning.

Dataset Engineering for AI \& ML Systems

Prepare high-quality datasets for machine learning and AI systems, with a strong focus

on:

dataset separation (training, validation, evaluation)

labeling and annotation workflows

preventing data leakage.

Support AI training processes only at the dataset level, without responsibility for model

training, optimization, or architecture design.

Ensure datasets are structured and documented to be reusable, auditable, and scalable.

Retrieval Quality \& Scalability

Evaluate and improve retrieval quality through systematic analysis of data organization,

chunking logic, and metadata structure.

Collaborate on defining retrieval evaluation strategies and data-driven improvements.

Ensure data pipelines and vector databases scale eﬃciently with growing data volumes

and usage demands.

Required Skills and Experience

Core Technical Skills

Strong experience with Python for data processing, pipeline development, and backend

services.

Solid background in data engineering, including ETL/ELT concepts and production-grade

data workflows.

Hands-on experience with vector databases (Qdrant, Pinecone, Weaviate, or similar).

Deep understanding of embedding-based retrieval systems and vector search concepts.

Proven experience designing RAG data pipelines and retrieval-focused datasets.

Data \& System Design

Expertise in dataset engineering, including structuring, validation, and lifecycle

management.

Strong knowledge of metadata design, JSON-based schemas, and structured data

handling.

Experience working in Linux environments (Ubuntu/Debian), including scripting and

troubleshooting.

Ability to design scalable and maintainable data architectures for long-term AI systems.

AI \& Machine Learning (Baseline Knowledge)

Practical understanding of how LLMs and AI systems consume data in RAG workflows.

Familiarity with AI training concepts only at a high level, suﬃcient to prepare high-quality

datasets.

No requirement to implement, tune, or manage model training pipelines.

Cultural Values \& Working Style

Data-centric mindset with strong attention to structure, quality, and long-term

maintainability.

Strong ownership mentality from data ingestion to retrieval readiness.

Ability to work closely with ML engineers and system architects while maintaining a clear

data engineering focus.

Preference for clean, scalable solutions over experimental or ad-hoc implementations.

Clear communication and documentation habits, especially around data design

decisions.

ML Data Engineer

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox