PAVE
🔧🚗 is an innovative automotive technology company transforming the way the world inspects vehicles. Powered by Intelligent Damage Detection capabilities,
PAVE
enables anyone with a smartphone to complete a guided vehicle inspection simply by taking photos of their car.
Headquartered in Toronto, our team brings deep expertise from both the automotive and technology industries, blending the best of artificial intelligence and automotive intelligence.
For more information, visit pave.ai.
Position Overview
We're seeking a skilled Site Reliability Engineer to join our DevOps team and ensure the stability and reliability of our enterprise vehicle inspection platform. Reporting to the Lead DevOps Engineer, you'll play a critical role in our GCP to AWS migration while maintaining and improving system reliability. As an SRE at PAVE.ai, you'll implement best practices for monitoring, incident response, and automation to achieve 99.9%+ uptime. You'll work hands-on with AWS infrastructure to build resilient systems that process millions of vehicle inspections for dealerships, fleet operators, insurers, and vehicle marketplaces globally.
Key Responsibilities
System Reliability \& Stability
Monitor and maintain production systems to ensure 99.9%+ uptime
Implement proactive monitoring and alerting to detect issues before they impact customers
Perform root cause analysis for incidents and implement permanent fixes
Create and maintain runbooks for common operational procedures
Participate in 24/7 on-call rotation and incident response
Conduct regular reliability reviews and implement improvements
AWS Infrastructure Management
Deploy and manage AWS services including EC2, ECS/EKS, RDS, S3, CloudFront
Optimize AWS infrastructure for performance, cost, and reliability
Implement AWS best practices for security, backup, and disaster recovery
Configure auto-scaling policies and load balancing for high availability
Manage AWS networking components (VPC, Security Groups, ALB/NLB)
Support migration efforts from GCP to AWS under Lead DevOps guidance
Monitoring \& Observability
Design and implement comprehensive monitoring solutions using CloudWatch, Prometheus, Grafana
Set up distributed tracing and application performance monitoring
Create meaningful dashboards and alerts for service health
Define and track SLIs (Service Level Indicators) for critical services
Implement log aggregation and analysis using ELK stack or similar
Establish baseline metrics and identify performance anomalies
Automation \& Infrastructure as Code
Develop automation scripts to reduce manual operations and toil
Implement Infrastructure as Code using Terraform and CloudFormation
Create CI/CD pipelines for reliable and repeatable deployments
Automate routine tasks such as backups, scaling, and maintenance
Build self-healing mechanisms for common failure scenarios
Develop tools to improve developer productivity and deployment velocity
Performance Optimization
Analyze system performance and identify bottlenecks
Optimize application and database performance
Implement caching strategies to reduce latency
Conduct load testing and capacity planning
Fine-tune resource allocation and utilization
Optimize cloud costs without compromising reliability
Incident Management
Respond to production incidents with urgency and professionalism
Follow incident management procedures and escalation protocols
Document incidents and contribute to post-mortem analysis
Implement preventive measures based on incident learnings
Improve MTTR (Mean Time To Recovery) through better tooling and processes
Maintain incident communication with stakeholders
Collaboration \& Documentation
Work closely with development teams to improve application reliability
Provide guidance on reliability best practices during design phase
Document infrastructure, procedures, and troubleshooting guides
Share knowledge through team presentations and training sessions
Collaborate on capacity planning and scaling strategies
Support developers with production debugging and optimization
Required Qualifications
Experience
2-5 years of experience in DevOps, SRE, or Infrastructure Engineering
2+ years of hands-on AWS experience in production environments
Experience maintaining high-traffic, high-availability systems
Proven track record of improving system reliability and uptime
Experience with 24/7 on-call responsibilities and incident management
Technical Skills
AWS Expertise:
Strong proficiency with core AWS services (EC2, S3, RDS, VPC, IAM)
Experience with container services (ECS, EKS, ECR)
Knowledge of AWS monitoring and logging (CloudWatch, CloudTrail)
Understanding of AWS security best practices
Experience with AWS CLI and SDKs
Familiarity with AWS Well-Architected Framework
SRE \& DevOps Tools:
Infrastructure as Code: Terraform, CloudFormation, or AWS CDK
Configuration management: Ansible, Chef, or Puppet
CI/CD tools: Jenkins, GitLab CI, GitHub Actions
Containerization: Docker, Kubernetes, Helm
Version control: Git, GitHub/GitLab
Scripting languages: Python, Bash, or Go
Monitoring \& Observability:
Prometheus, Grafana, or similar metrics platforms
Log management: ELK Stack, OpenSearch, or CloudWatch Logs
APM tools: New Relic, Datadog, or OpenSearch
Distributed tracing: Jaeger, Zipkin, or AWS X-Ray
Alert management: PagerDuty, Opsgenie, or similar
Technical Fundamentals:
Strong Linux/Unix system administration skills
Networking concepts: TCP/IP, DNS, Load Balancing, CDN
Database administration: PostgreSQL, MySQL, Redis, MongoDB
Understanding of distributed systems and microservices
Knowledge of security principles and best practices
Experience with performance tuning and optimization
Soft Skills
Strong problem-solving and troubleshooting abilities
Excellent written and verbal communication skills in both English and Vietnamese
Ability to work effectively under pressure during incidents
Detail-oriented with strong documentation skills
Team player with collaborative mindset
Proactive approach to identifying and solving problems
Continuous learning mindset for new technologies
Preferred Qualifications
AWS certifications (SysOps Administrator, DevOps Engineer, or Solutions Architect)
Experience with GCP and cloud migration projects
Knowledge of SRE practices from Google's SRE book
Experience with AI/ML infrastructure and GPU workloads
Familiarity with automotive industry or vehicle inspection systems
Experience with chaos engineering and failure injection
Knowledge of compliance frameworks (SOC2, ISO 27001)
Experience with serverless architectures (Lambda, API Gateway)
Contributions to open-source DevOps/SRE projects
Experience with FinOps and cloud cost optimization
Success Metrics
Maintain 99.9%+ uptime for assigned services
Reduce incident MTTR by 30% within first year
Automate 50% of manual operational tasks
Zero critical security incidents
Achieve all SLO targets for assigned services
Complete AWS migration tasks on schedule
What We Offer
Competitive salary
Flexible work arrangements, including hybrid options
13th-month bonus in accordance with company policy
Comprehensive health, dental, and vision insurance for the employee and one dependent
Professional development budget for AWS certifications
On-call compensation and time-off policies
Opportunity to work with cutting-edge cloud technologies
Career growth path to Senior SRE or Lead positions
Collaborative and innovative work environment
Location
Hybrid settings, D1, HCMC.