LiteLLM is the world’s most popular AI Gateway used by the largest companies (Adobe, Netflix, NASA, etc.) in the world to give their developers access to LLMs and adjacent services (MCP’s, Vector Stores, etc.).

When LiteLLM goes down, our customers' entire AI stack goes down. We need someone who makes sure that doesn't happen.

You'd be the first dedicated reliability hire. You'll own reliability, performance, and production stability end-to-end. Nobody will tell you how to do it

What this job actually is

We'll be straight with you: this role is roughly 60% operational reliability and 40% deep performance engineering. On any given week you might be:

Hunting a memory leak in our async streaming handler that causes OOMs after 4 hours under load

Fixing a race condition where PodLockManager releases another pod's lock

Profiling why update_database() does 7 deep copies per request in the spend tracking hot path

Helping a Fortune 500 customer debug why their 20-pod deployment is exhausting Postgres connections

Building soak tests that catch degradation before a release goes out

Reviewing a PR that touches the request hot path and saying "this will add 50ms at P99, here's why"

If you're looking for a pure optimization role where you sit in a profiler all day — this isn't it. If you want to own production health for one of the most widely deployed AI infrastructure projects in the world — keep reading.

Why this matters

We route traffic for some of the largest AI deployments on the planet. One customer is scaling from 20M to 200M daily AI calls through our gateway. Another has 150K users hitting us daily. When we ship a bad release, it doesn't just break a dashboard — it breaks production AI systems at companies you've heard of.

The problems here are genuinely hard:

Memory management in long-running Python async services — our proxy handles thousands of concurrent streaming connections. HTTP client sessions, response iterators, and background tasks all need careful lifecycle management.

Database at scale — spend logging, auth, and rate limiting all interact with Postgres. At 100K+ requests/day, naive patterns fall apart.

100+ provider surface area — we translate between OpenAI, Anthropic, Bedrock, Vertex, and 100+ other APIs. Each has unique streaming behavior. A refactor that fixes one provider can break three others.

You won't run out of interesting problems.

What you'll own

Production reliability

On-call for critical issues (shared rotation with the team, not solo)

Incident response and blameless post-mortems

Customer escalation support for enterprise deployments

Making the proxy self-healing when DB/Redis is temporarily unavailable

Performance engineering

Memory leak detection and prevention (soak tests, CI integration)

Hot path optimization — our target is \<10ms overhead at 5K+ RPS

P50/P95/P99 latency benchmarks that block releases on regression

Profiling and fixing bottlenecks (Pydantic validation, connection pools, async task scheduling)

Observability \& release safety

Structured logging, distributed tracing, correlation IDs

Prometheus metrics that are actually accurate and actionable

Building toward canary deployments and automated rollback

SLO definition and tracking for enterprise customers

Who you are

Must have:

2+ years of experience running Python services in production, with real exposure to debugging things that break at scale

Strong understanding of Python async internals — asyncio event loop, aiohttp/httpx session management, connection pooling

Experience debugging production memory leaks, OOMs, or latency degradation (bonus if you've used memray, py-spy, or tracemalloc)

Solid PostgreSQL knowledge — connection pool tuning, query optimization, understanding how DB operations on the request path degrade under load

Comfort with Kubernetes at an operational level — pod lifecycle, resource limits, health probes

You've been on-call before and you didn't hate it

Strong signals:

You've worked on a proxy, API gateway, load balancer, or middleware service where overhead itself is what you optimize

You've worked at Meta (Production Engineering), Cloudflare, Fastly, Datadog, Stripe, or a similar infrastructure company

You've been an early reliability/infra hire at a startup and built production practices from scratch

You've contributed to open-source infrastructure projects

You understand HTTP/2, streaming responses (SSE), and how async Python handles them under concurrency

Why LiteLLM

Scale \& impact: Your work is in the critical path for hundreds of millions of AI API calls daily. NASA, Netflix, Adobe, Stripe depend on this.

Open source visibility: 36K GitHub stars. Your contributions are visible to the entire AI infrastructure community. Your GitHub profile will look incredible.

Ownership: First dedicated reliability hire. You define what reliability means here. No bureaucracy, no tickets — you see a problem, you fix it.

Trajectory: $7M ARR growing fast, 10-person team, YC W23. Meaningful equity at a stage where it can matter.

About LiteLLM

LiteLLM (https://github.com/BerriAI/litellm) is a Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere] and is used by companies like Rocket Money, Adobe, Twilio, and Siemens.

Founding Reliability & Performance Engineer

Job Description

Login / Register

👋 Let's find you a Dream Job

Check Your Email!

Get job updates in your inbox