👨🏻‍💻 postech.work

DevOps / Site Reliability Engineer (SRE)

Growtrics • 🌐 In Person

In Person Posted 5 days, 5 hours ago

Job Description

Company Description

Growtrics is an innovative EdTech company leveraging data and artificial intelligence to optimize individual learning experiences. The company is committed to developing cutting-edge solutions that empower people to achieve their learning goals effectively. With a strong focus on innovation and technology, Growtrics strives to make education more personalized and accessible to learners worldwide. Joining Growtrics means being part of a team that values creativity, growth, and impactful outcomes.

Role Description

We are seeking an experienced DevOps/SRE professional to architect, build, and maintain a resilient, scalable, and highly available infrastructure for our Growtrics platform, an AI-powered EdTech application. You will play a critical role in ensuring rapid, high-quality delivery across our technology stack:

Flutter mobile applications (iOS \& Android)

FastAPI backend services • GPU-accelerated ML/LLM workloads

NextJS web portal with Firebase backend

This role requires deep expertise in full-stack CI/CD, MLOps, and high-traffic systems to guarantee operational excellence and fault tolerance.

Key Responsibilities

CI/CD \& Delivery Gating Systems

Architect and maintain CI/CD pipelines for mobile (Flutter) and backend (FastAPI) services, ensuring

fast, reliable, and repeatable deployments

.

Implement

robust Quality Gating Systems

at every stage: commit, build, test, deploy. Include security scans, linting, performance testing, contract testing, and E2E tests.

Support

immutable infrastructure patterns

and zero-downtime deployments with rapid rollback strategies.

Orchestrate

multi-environment releases

(Dev, Stage, Prod) with proper change management and versioning controls.

Optimize pipelines to handle

large-scale parallel deployments

and artifact management.

Testing \& Quality Automation

Collaborate with developers to

integrate automated testing

at all levels: unit, integration, contract (FastAPI/OpenAPI), and end-to-end (Flutter).

Implement

performance, load, and stress testing frameworks

for backend services and LLM endpoints to anticipate production-scale workloads.

Provision

ephemeral testing environments

with automated setup and teardown for QA and feature testing.

MLOps \& GPU Resource Management

Manage

GPU workloads

on serverless or cloud-hosted infrastructure (e.g., Modal, AWS/GCP GPU instances), including scheduling, scaling, and monitoring.

Implement

autoscaling policies

and resource quotas for ML/LLM workloads in Kubernetes, considering cost and latency optimization.

Integrate

ML workflow orchestration

for training, fine-tuning, and serving models at scale.

Core Infrastructure \& Scalability

Design, deploy, and operate

highly available Kubernetes clusters

with multi-region redundancy. Utilize HPA/VPA and custom metrics for autoscaling.

Ensure

resilience and fault tolerance

via chaos engineering, DR testing, and multi-zone deployments.

Manage

Observability \& Monitoring

: Prometheus, Grafana, ELK stack, logging and tracing to monitor API latency, Flutter performance, GPU utilization, WebSocket connections, and Celery tasks.

Maintain

cloud infrastructure using IaC

(Terraform, CloudFormation) for repeatable, auditable provisioning.

High-Traffic Systems Management

Optimize

FastAPI backend for high-concurrency scenarios

, including WebSocket connections and Celery task queues.

Ensure

efficient load balancing, caching, and request routing

to reduce latency and handle spikes in traffic.

Implement

rate-limiting, throttling, and backpressure strategies

for LLM-heavy endpoints.

Security \& Compliance

Enforce

security best practices

: data encryption at rest and in transit, secrets management, and DLP.

Ensure compliance with GDPR, HIPAA, and other relevant regulations.

Conduct

penetration testing, vulnerability scans, and incident response

preparation.

Web \& Mobile Support

Maintain and contribute to the

NextJS/Firebase web portal

.

Collaborate with mobile teams to

optimize app performance, crash monitoring, and release automation

.

Required Skills \& Qualifications

5+ years in a

DevOps/SRE role

managing full-stack, high-traffic production environments.

Expert in

Kubernetes cluster design, deployment, and operations

for mobile + ML workloads.

Proven experience building and automating

CI/CD pipelines

for Flutter (iOS/Android) and web/REST services.

Hands-on experience with

high-quality gating systems

and integrating automated testing at scale.

Experience with

GPU resource management

, orchestration, and optimization for ML workloads.

Proficient in

Python/FastAPI backend services

, WebSockets, Celery task queues.

Strong skills in

Infrastructure-as-Code (Terraform)

and multi-cloud management (AWS/GCP/Azure).

Expertise in

Observability

: Prometheus, Grafana, ELK, OpenTelemetry, and application performance

Deep understanding of

security, compliance, and disaster recovery

in cloud-native systems.

Preferred Qualifications:

Experience with

mobile CI/CD tools

(Codemagic, Fastlane, Shorebird).

Familiarity with

microservices communication

patterns (gRPC, message queues).

Certified Kubernetes Administrator (CKA) or equivalent cloud certifications.

Experience scaling

LLM endpoints

and GPU-heavy ML services.

Familiarity with

serverless deployments

and event-driven architectures.

Why you'll love working here:

Highly competitive compensation package, including 100% salary during probation, designed to reward your talent and dedication.

13th-month salary and performance-based bonuses to celebrate your achievements and contributions.

Comprehensive Social Insurance calculated on gross salary, giving you peace of mind and full legal protection.

Generous and flexible time-off policy: 14 annual leave days + 6 sick leave days to recharge and maintain work-life balance.

Flexible working hours, allowing you to manage your schedule and work in a way that suits your lifestyle.

MacBook provided, ensuring you have the tools to perform at your best from day one.

Full compliance with the Vietnam Labor Code, so you can focus on work with confidence in a fair and lawful environment.

Opportunity to collaborate directly with global stakeholders, gaining exposure to international best practices and expanding your professional network.

A workplace that truly values your well-being: regular social events, sports clubs, gym activities, and team-building programs to foster connection and fun.

Vibrant, youthful, and international culture that encourages creativity, innovation, and continuous growth.

Fully stocked pantry with a variety of snacks, milk, and beverages to keep you energized throughout the day.

Recognition and rewards based on your experience, skills, and qualifications, ensuring your contributions are appreciated and fairly compensated.

Get job updates in your inbox

Subscribe to our newsletter and stay updated with the best job opportunities.