Senior Site Reliability Engineer (SRE)Full-Time \| 100% Remote (USA)
Dive into a
high-impact SRE
role where you'll safeguard and supercharge a cutting-edge data platform. We're talking rock-solid reliability for apps that deliver real-time business wins—join us to architect the invisible magic that keeps everything humming! What You'll Rock
Own Reliability End-to-End: Jump into on-call rotations, nail incident response, and lead postmortems to make systems bulletproof.
Build Epic Infra: Design, deploy, and scale cloud setups with
Terraform and IaC across AWS (must), GCP, and Azure.
Master Kubernetes:
Run clusters on EKS with Bottlerocket OS and Cilium/eBPF
for next-level networking and security.
Streamline Deploys: Roll out apps via
Helm and FluxCD,
while plotting upgrades to fully autonomous operators.
Amp Up Observability: Set up monitoring stacks with
OpenTelemetry, Prometheus, and Grafana's LGTM stack
(Loki for logs, Tempo for traces, Mimir for metrics) to spot issues before they bite.
Team Up for Wins: Partner with product and eng crews to bake reliability into every feature from day one.
Must-Have Superpowers
Battle-tested in high-stakes prod environments: On-call heroics, swift incident handling under tight SLAs, and crystal-clear comms for escalations.
Hands-on AWS wizardry with
Terraform; bonus for GCP or Azure
.
Deep Kubernetes know-how:
Cluster ops, Helm charts, community operators (like CNPG), and GitOps tools like Flux.
Linux and networking ninja: TCP/IP mastery, plus security, compliance, and hot tech like eBPF/Cilium.
Comfort with OpenTelemetry and Prometheus for observability awesomeness.
If you're a reliability rockstar ready to tame chaos and build unbreakable systems, apply now—let's make downtime a myth!