Job Title:
ML Data Engineer (RAG \& Vector Database Focused)
Working Type:
Full Time \| On-site
Location:
D3, HCMC, Vietnam
Overview
We are seeking an experienced Machine Learning Data Engineer with deep expertise in
Retrieval-Augmented Generation (RAG) systems and vector database–centric data
architectures. This role is data-first, not model-training–first.
The primary responsibility is to design, build, and maintain high-quality, scalable data
pipelines that power RAG systems, including dataset engineering, embedding lifecycle
management, and vector database operations.
The engineer will work closely with AI/ML systems but is not expected to perform model
training or act as an AI trainer. Model training and fine-tuning will be handled separately.
Key Responsibilities
RAG Data Engineering \& Architecture
Design and maintain RAG-ready data pipelines from raw data ingestion to retrieval-ready
corpora.
Prepare and structure datasets specifically for retrieval-based systems, ensuring data
quality, relevance, and consistency.
Implement document preprocessing workflows including:
cleaning
normalization
deduplication
segmentation and chunking strategies optimized for retrieval.
Architect scalable RAG data architectures, supporting incremental updates and long-term
maintainability.
Vector Database \& Embedding Lifecycle
Build and operate vector database pipelines (e.g., Qdrant or similar systems).
Design vector collections, indexing strategies, and payload/metadata schemas for
efficient filtering and retrieval.
Manage the full embedding lifecycle, including:
embedding generation
versioning
re-embedding strategies
re-indexing workflows.
Optimize retrieval performance through proper data modeling, metadata usage, and
vector index tuning.
Dataset Engineering for AI \& ML Systems
Prepare high-quality datasets for machine learning and AI systems, with a strong focus
on:
dataset separation (training, validation, evaluation)
labeling and annotation workflows
preventing data leakage.
Support AI training processes only at the dataset level, without responsibility for model
training, optimization, or architecture design.
Ensure datasets are structured and documented to be reusable, auditable, and scalable.
Retrieval Quality \& Scalability
Evaluate and improve retrieval quality through systematic analysis of data organization,
chunking logic, and metadata structure.
Collaborate on defining retrieval evaluation strategies and data-driven improvements.
Ensure data pipelines and vector databases scale efficiently with growing data volumes
and usage demands.
Required Skills and Experience
Core Technical Skills
Strong experience with Python for data processing, pipeline development, and backend
services.
Solid background in data engineering, including ETL/ELT concepts and production-grade
data workflows.
Hands-on experience with vector databases (Qdrant, Pinecone, Weaviate, or similar).
Deep understanding of embedding-based retrieval systems and vector search concepts.
Proven experience designing RAG data pipelines and retrieval-focused datasets.
Data \& System Design
Expertise in dataset engineering, including structuring, validation, and lifecycle
management.
Strong knowledge of metadata design, JSON-based schemas, and structured data
handling.
Experience working in Linux environments (Ubuntu/Debian), including scripting and
troubleshooting.
Ability to design scalable and maintainable data architectures for long-term AI systems.
AI \& Machine Learning (Baseline Knowledge)
Practical understanding of how LLMs and AI systems consume data in RAG workflows.
Familiarity with AI training concepts only at a high level, sufficient to prepare high-quality
datasets.
No requirement to implement, tune, or manage model training pipelines.
Cultural Values \& Working Style
Data-centric mindset with strong attention to structure, quality, and long-term
maintainability.
Strong ownership mentality from data ingestion to retrieval readiness.
Ability to work closely with ML engineers and system architects while maintaining a clear
data engineering focus.
Preference for clean, scalable solutions over experimental or ad-hoc implementations.
Clear communication and documentation habits, especially around data design
decisions.