Lead DevOps Engineer — Tenant Inc. (Technical-Oriented)
Overview
Tenant Inc. is executing a multi-phase modernization of its cloud platform, infrastructure, and release engineering ecosystem. The Lead DevOps Engineer functions as the technical owner of the company’s cloud delivery stack, responsible for end-to-end reliability, automation, observability, and environment governance across a multi-tenant SaaS platform serving more than 1,700 properties.
This role drives architectural evolution, enforces engineering rigor, and partners with Engineering, QA, SRE, Security, and Product to harden release processes, standardize environments, and increase deployment throughput while improving system resilience and operational predictability.
Key Responsibilities
Platform Architecture \& Infrastructure
Architect, scale, and optimize AWS-based infrastructure including EKS clusters, VPC topology, multi-AZ networking, service mesh patterns, and secure multi-tenant isolation models.
Modernize compute, storage, and networking layers using cloud-native primitives (EKS, ALB/NLB, ASG, EBS gp3/io2, ElastiCache, RDS/Aurora) to support high-availability, horizontal scaling, and deterministic deployments.
Own the full environment lifecycle (Dev → QA → UAT → Stage → Prod), ensuring deterministic configuration, infrastructure immutability, and compliance with internal governance and audit requirements.
Define and enforce infrastructure standards, golden AMIs/base images, cluster add-ons, and environment-level configuration baselines.
CI/CD, Release Governance \& Automation
Re-architect CI/CD pipelines to support microservices, multi-repo workflows, automated test orchestration, and progressive delivery strategies.
Implement blue/green, canary, and feature-flag-driven deployment models using GitOps, ArgoCD, Helm, and workload identity.
Automate versioning, artifact promotion, rollback orchestration, and release validation workflows with strong auditability and traceability.
Partner with Engineering and QA to enforce quality gates, environment readiness checks, and predictable release cadences aligned with release governance frameworks.
Observability, Monitoring \& Incident Readiness
Lead the deployment, tuning, and integration of observability platforms including OpenSearch, Prometheus, Grafana, New Relic, Catchpoint, CloudWatch, and distributed tracing frameworks.
Define and operationalize SLOs, SLIs, and error budgets for core platform services, integrating them into alerting and escalation workflows.
Build automated health checks, synthetic monitoring, and self-healing mechanisms (auto-remediation, circuit breakers, restart policies) to reduce MTTR.
Drive structured incident response, post-incident analysis, and continuous improvement cycles across engineering teams.
Security, Compliance \& Governance
Integrate SAST, DAST, dependency scanning, and container image scanning into CI/CD pipelines to enforce secure-by-default delivery.
Implement IAM governance, workload identity, secrets management (AWS Secrets Manager / HashiCorp Vault), and hardened configuration baselines across all environments.
Partner with leadership to maintain SOC2, PCI, and internal compliance controls across infrastructure, pipelines, and release processes.
Leadership \& Cross-Functional Collaboration
Serve as the technical lead for DevOps, platform reliability, and cloud modernization initiatives.
Mentor DevOps and SRE engineers globally, establishing engineering standards, documentation practices, and operational excellence.
Collaborate with Engineering, QA, Product, and Architecture to align infrastructure strategy with Tenant’s modernization roadmap and long-term platform evolution.
Drive cross-team initiatives including environment standardization, sandbox architecture, load-testing readiness, and operational maturity programs.
Operational Excellence \& Cost Governance
Lead root cause analysis using structured methodologies (5 Whys, fishbone, fault-tree analysis) and drive systemic remediation across the platform.
Reduce manual toil through automation, job orchestration, and platform tooling improvements.
Partner with leadership to optimize AWS spend, implement cost-aware architectures, and improve cost visibility across compute, storage, and data services.
Required Qualifications
7+ years in DevOps, SRE, or Platform Engineering roles supporting cloud-native, distributed systems.
Deep expertise with AWS, Kubernetes (EKS), Terraform, Helm, GitHub Actions, and GitOps workflows.
Strong scripting and automation skills (Python, Bash, Go, Node.js).
Hands-on experience with observability and APM stacks (OpenSearch, Prometheus, Grafana, New Relic, Catchpoint, CloudWatch, clickstream analytics).
Strong understanding of networking, distributed systems, and multi-tenant SaaS architectures.
Proven ability to lead technical initiatives and mentor globally distributed teams.
Preferred Qualifications
Experience operating in high-growth SaaS environments with complex release pipelines and multi-service architectures.
Familiarity with event-driven architectures (SNS/SQS, EventBridge), asynchronous processing, and message-driven workflows.
Experience with data pipelines, ETL orchestration, and analytics platforms.
Background in cloud cost optimization, FinOps practices, and cost-efficient architectural design.
Experience stabilizing or modernizing legacy systems and migrating them to cloud-native patterns.
Success Indicators at Tenant Inc.
Predictable, low-risk releases with measurable reductions in deployment failures and rollback frequency.
Fully standardized Dev → QA → UAT → Stage → Prod environments with consistent configuration and deployment behavior.
Improved observability maturity, reduced MTTR, and fewer customer-impacting incidents.
Automated, auditable release governance (ECRs) integrated into engineering and product workflows.
Strong documentation, cross-team alignment, and measurable improvements in engineering velocity.