We are working with a decentralised exchange which looks to innovate on providing the best of CEXs and DEXs, focusing on building a safe, simple and scalable platform for trading. They differentiate themselves by offering institutional level systems and support whilst remaining on-chain and decentralised.
Seeking a Senior Site Reliability Engineer to join our team in ensuring the stability, scalability, and performance of a cutting-edge platform. You will balance production reliability with engineering-driven automation, reducing manual processes through innovative tooling and process improvements. This role requires a strong commitment to on-call ownership and a passion for building resilient, observable, and self-healing infrastructure.
Key Responsibilities
Design, implement, and maintain scalable infrastructure for a high-performance, low-latency trading platform.
Operate and enhance Kubernetes and Nomad-based environments to ensure system stability, scalability, and security.
Develop infrastructure automation and deployment pipelines using Terraform, Ansible, ArgoCD, and GitHub Actions.
Collaborate with engineering teams to streamline service onboarding, automate repetitive tasks, and improve deployment efficiency.
Enhance observability and reliability through improved logging, metrics, tracing, and alerting using the Grafana ecosystem.
Perform root cause analysis and postmortems for production incidents, driving continuous improvements in system resilience and incident response.
Work with security and compliance teams to ensure infrastructure meets regulatory and organizational standards.
Support multi-environment deployments (dev, staging, testnet, mainnet) with a focus on safe rollouts, rollbacks, and configuration management.
Contribute to capacity planning, cost optimization, and infrastructure scaling strategies to support platform growth.
Experience \& Skills Requirements
5+ years of relevant experience as DevOps/ SRE Engineers.
Proven ability to participate in an on-call rotation, demonstrating ownership in incident response and a focus on long-term system stability.
Extensive experience operating and maintaining low-latency, distributed systems in production environments.
Proficiency with cloud-native platforms and container orchestration tools, including AWS, GCP, Kubernetes, and Nomad.
Strong knowledge of Linux/Unix internals and the TCP/IP networking stack.
Proficiency in one or more of: Bash, Go, or Python.
Expertise in root cause analysis, performance tuning, and system-level debugging in complex service architectures.
Experience building and managing end-to-end infrastructure, including infrastructure as code, CI/CD pipelines, and monitoring systems.
Familiarity with modern GitOps workflows and tools such as GitHub Actions, ArgoCD, Argo Workflows, and Argo Events.
Ability to own production systems end-to-end, from infrastructure as code to automated monitoring and deployment workflows.
Pragmatic approach with a focus on depth, ownership, and a bias for action over broad familiarity.
Bonus: Experience with the Aeron messaging system is a strong advantage.