PAVE

🔧🚗 is an innovative automotive technology company transforming the way the world inspects vehicles. Powered by Intelligent Damage Detection capabilities,

PAVE

enables anyone with a smartphone to complete a guided vehicle inspection simply by taking photos of their car.

Headquartered in Toronto, our team brings deep expertise from both the automotive and technology industries, blending the best of artificial intelligence and automotive intelligence.

For more information, visit pave.ai.

Position Overview

We're seeking a skilled Site Reliability Engineer to join our DevOps team and ensure the stability and reliability of our enterprise vehicle inspection platform. Reporting to the Lead DevOps Engineer, you'll play a critical role in our GCP to AWS migration while maintaining and improving system reliability. As an SRE at PAVE.ai, you'll implement best practices for monitoring, incident response, and automation to achieve 99.9%+ uptime. You'll work hands-on with AWS infrastructure to build resilient systems that process millions of vehicle inspections for dealerships, fleet operators, insurers, and vehicle marketplaces globally.

Key Responsibilities

System Reliability \& Stability

Monitor and maintain production systems to ensure 99.9%+ uptime

Implement proactive monitoring and alerting to detect issues before they impact customers

Perform root cause analysis for incidents and implement permanent fixes

Create and maintain runbooks for common operational procedures

Participate in 24/7 on-call rotation and incident response

Conduct regular reliability reviews and implement improvements

AWS Infrastructure Management

Deploy and manage AWS services including EC2, ECS/EKS, RDS, S3, CloudFront

Optimize AWS infrastructure for performance, cost, and reliability

Implement AWS best practices for security, backup, and disaster recovery

Configure auto-scaling policies and load balancing for high availability

Manage AWS networking components (VPC, Security Groups, ALB/NLB)

Support migration efforts from GCP to AWS under Lead DevOps guidance

Monitoring \& Observability

Design and implement comprehensive monitoring solutions using CloudWatch, Prometheus, Grafana

Set up distributed tracing and application performance monitoring

Create meaningful dashboards and alerts for service health

Define and track SLIs (Service Level Indicators) for critical services

Implement log aggregation and analysis using ELK stack or similar

Establish baseline metrics and identify performance anomalies

Automation \& Infrastructure as Code

Develop automation scripts to reduce manual operations and toil

Implement Infrastructure as Code using Terraform and CloudFormation

Create CI/CD pipelines for reliable and repeatable deployments

Automate routine tasks such as backups, scaling, and maintenance

Build self-healing mechanisms for common failure scenarios

Develop tools to improve developer productivity and deployment velocity

Performance Optimization

Analyze system performance and identify bottlenecks

Optimize application and database performance

Implement caching strategies to reduce latency

Conduct load testing and capacity planning

Fine-tune resource allocation and utilization

Optimize cloud costs without compromising reliability

Incident Management

Respond to production incidents with urgency and professionalism

Follow incident management procedures and escalation protocols

Document incidents and contribute to post-mortem analysis

Implement preventive measures based on incident learnings

Improve MTTR (Mean Time To Recovery) through better tooling and processes

Maintain incident communication with stakeholders

Collaboration \& Documentation

Work closely with development teams to improve application reliability

Provide guidance on reliability best practices during design phase

Document infrastructure, procedures, and troubleshooting guides

Share knowledge through team presentations and training sessions

Collaborate on capacity planning and scaling strategies

Support developers with production debugging and optimization

Required Qualifications

Experience

2-5 years of experience in DevOps, SRE, or Infrastructure Engineering

2+ years of hands-on AWS experience in production environments

Experience maintaining high-traffic, high-availability systems

Proven track record of improving system reliability and uptime

Experience with 24/7 on-call responsibilities and incident management

Technical Skills

AWS Expertise:

Strong proficiency with core AWS services (EC2, S3, RDS, VPC, IAM)

Experience with container services (ECS, EKS, ECR)

Knowledge of AWS monitoring and logging (CloudWatch, CloudTrail)

Understanding of AWS security best practices

Experience with AWS CLI and SDKs

Familiarity with AWS Well-Architected Framework

SRE \& DevOps Tools:

Infrastructure as Code: Terraform, CloudFormation, or AWS CDK

Configuration management: Ansible, Chef, or Puppet

CI/CD tools: Jenkins, GitLab CI, GitHub Actions

Containerization: Docker, Kubernetes, Helm

Version control: Git, GitHub/GitLab

Scripting languages: Python, Bash, or Go

Monitoring \& Observability:

Prometheus, Grafana, or similar metrics platforms

Log management: ELK Stack, OpenSearch, or CloudWatch Logs

APM tools: New Relic, Datadog, or OpenSearch

Distributed tracing: Jaeger, Zipkin, or AWS X-Ray

Alert management: PagerDuty, Opsgenie, or similar

Technical Fundamentals:

Strong Linux/Unix system administration skills

Networking concepts: TCP/IP, DNS, Load Balancing, CDN

Database administration: PostgreSQL, MySQL, Redis, MongoDB

Understanding of distributed systems and microservices

Knowledge of security principles and best practices

Experience with performance tuning and optimization

Soft Skills

Strong problem-solving and troubleshooting abilities

Excellent written and verbal communication skills in both English and Vietnamese

Ability to work effectively under pressure during incidents

Detail-oriented with strong documentation skills

Team player with collaborative mindset

Proactive approach to identifying and solving problems

Continuous learning mindset for new technologies

Preferred Qualifications

AWS certifications (SysOps Administrator, DevOps Engineer, or Solutions Architect)

Experience with GCP and cloud migration projects

Knowledge of SRE practices from Google's SRE book

Experience with AI/ML infrastructure and GPU workloads

Familiarity with automotive industry or vehicle inspection systems

Experience with chaos engineering and failure injection

Knowledge of compliance frameworks (SOC2, ISO 27001)

Experience with serverless architectures (Lambda, API Gateway)

Contributions to open-source DevOps/SRE projects

Experience with FinOps and cloud cost optimization

Success Metrics

Maintain 99.9%+ uptime for assigned services

Reduce incident MTTR by 30% within first year

Automate 50% of manual operational tasks

Zero critical security incidents

Achieve all SLO targets for assigned services

Complete AWS migration tasks on schedule

What We Offer

Competitive salary

Flexible work arrangements, including hybrid options

13th-month bonus in accordance with company policy

Comprehensive health, dental, and vision insurance for the employee and one dependent

Professional development budget for AWS certifications

On-call compensation and time-off policies

Opportunity to work with cutting-edge cloud technologies

Career growth path to Senior SRE or Lead positions

Collaborative and innovative work environment

Location

Hybrid settings, D1, HCMC.

Site Reliability Engineer

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox