Company Description
Growtrics is an innovative EdTech company leveraging data and artificial intelligence to optimize individual learning experiences. The company is committed to developing cutting-edge solutions that empower people to achieve their learning goals effectively. With a strong focus on innovation and technology, Growtrics strives to make education more personalized and accessible to learners worldwide. Joining Growtrics means being part of a team that values creativity, growth, and impactful outcomes.
Role Description
We are seeking an experienced DevOps/SRE professional to architect, build, and maintain a resilient, scalable, and highly available infrastructure for our Growtrics platform, an AI-powered EdTech application. You will play a critical role in ensuring rapid, high-quality delivery across our technology stack:
Flutter mobile applications (iOS \& Android)
FastAPI backend services • GPU-accelerated ML/LLM workloads
NextJS web portal with Firebase backend
This role requires deep expertise in full-stack CI/CD, MLOps, and high-traffic systems to guarantee operational excellence and fault tolerance.
Key Responsibilities
CI/CD \& Delivery Gating Systems
Architect and maintain CI/CD pipelines for mobile (Flutter) and backend (FastAPI) services, ensuring
fast, reliable, and repeatable deployments
.
Implement
robust Quality Gating Systems
at every stage: commit, build, test, deploy. Include security scans, linting, performance testing, contract testing, and E2E tests.
Support
immutable infrastructure patterns
and zero-downtime deployments with rapid rollback strategies.
Orchestrate
multi-environment releases
(Dev, Stage, Prod) with proper change management and versioning controls.
Optimize pipelines to handle
large-scale parallel deployments
and artifact management.
Testing \& Quality Automation
Collaborate with developers to
integrate automated testing
at all levels: unit, integration, contract (FastAPI/OpenAPI), and end-to-end (Flutter).
Implement
performance, load, and stress testing frameworks
for backend services and LLM endpoints to anticipate production-scale workloads.
Provision
ephemeral testing environments
with automated setup and teardown for QA and feature testing.
MLOps \& GPU Resource Management
Manage
GPU workloads
on serverless or cloud-hosted infrastructure (e.g., Modal, AWS/GCP GPU instances), including scheduling, scaling, and monitoring.
Implement
autoscaling policies
and resource quotas for ML/LLM workloads in Kubernetes, considering cost and latency optimization.
Integrate
ML workflow orchestration
for training, fine-tuning, and serving models at scale.
Core Infrastructure \& Scalability
Design, deploy, and operate
highly available Kubernetes clusters
with multi-region redundancy. Utilize HPA/VPA and custom metrics for autoscaling.
Ensure
resilience and fault tolerance
via chaos engineering, DR testing, and multi-zone deployments.
Manage
Observability \& Monitoring
: Prometheus, Grafana, ELK stack, logging and tracing to monitor API latency, Flutter performance, GPU utilization, WebSocket connections, and Celery tasks.
Maintain
cloud infrastructure using IaC
(Terraform, CloudFormation) for repeatable, auditable provisioning.
High-Traffic Systems Management
Optimize
FastAPI backend for high-concurrency scenarios
, including WebSocket connections and Celery task queues.
Ensure
efficient load balancing, caching, and request routing
to reduce latency and handle spikes in traffic.
Implement
rate-limiting, throttling, and backpressure strategies
for LLM-heavy endpoints.
Security \& Compliance
Enforce
security best practices
: data encryption at rest and in transit, secrets management, and DLP.
Ensure compliance with GDPR, HIPAA, and other relevant regulations.
Conduct
penetration testing, vulnerability scans, and incident response
preparation.
Web \& Mobile Support
Maintain and contribute to the
NextJS/Firebase web portal
.
Collaborate with mobile teams to
optimize app performance, crash monitoring, and release automation
.
Required Skills \& Qualifications
5+ years in a
DevOps/SRE role
managing full-stack, high-traffic production environments.
Expert in
Kubernetes cluster design, deployment, and operations
for mobile + ML workloads.
Proven experience building and automating
CI/CD pipelines
for Flutter (iOS/Android) and web/REST services.
Hands-on experience with
high-quality gating systems
and integrating automated testing at scale.
Experience with
GPU resource management
, orchestration, and optimization for ML workloads.
Proficient in
Python/FastAPI backend services
, WebSockets, Celery task queues.
Strong skills in
Infrastructure-as-Code (Terraform)
and multi-cloud management (AWS/GCP/Azure).
Expertise in
Observability
: Prometheus, Grafana, ELK, OpenTelemetry, and application performance
Deep understanding of
security, compliance, and disaster recovery
in cloud-native systems.
Preferred Qualifications:
Experience with
mobile CI/CD tools
(Codemagic, Fastlane, Shorebird).
Familiarity with
microservices communication
patterns (gRPC, message queues).
Certified Kubernetes Administrator (CKA) or equivalent cloud certifications.
Experience scaling
LLM endpoints
and GPU-heavy ML services.
Familiarity with
serverless deployments
and event-driven architectures.
Why you'll love working here:
Highly competitive compensation package, including 100% salary during probation, designed to reward your talent and dedication.
13th-month salary and performance-based bonuses to celebrate your achievements and contributions.
Comprehensive Social Insurance calculated on gross salary, giving you peace of mind and full legal protection.
Generous and flexible time-off policy: 14 annual leave days + 6 sick leave days to recharge and maintain work-life balance.
Flexible working hours, allowing you to manage your schedule and work in a way that suits your lifestyle.
MacBook provided, ensuring you have the tools to perform at your best from day one.
Full compliance with the Vietnam Labor Code, so you can focus on work with confidence in a fair and lawful environment.
Opportunity to collaborate directly with global stakeholders, gaining exposure to international best practices and expanding your professional network.
A workplace that truly values your well-being: regular social events, sports clubs, gym activities, and team-building programs to foster connection and fun.
Vibrant, youthful, and international culture that encourages creativity, innovation, and continuous growth.
Fully stocked pantry with a variety of snacks, milk, and beverages to keep you energized throughout the day.
Recognition and rewards based on your experience, skills, and qualifications, ensuring your contributions are appreciated and fairly compensated.