The Role
Luminance’s Site Reliability team combines strong problem solving, infrastructure tooling and wider DevOps practices to provide a service of Luminance’s unique software applications. The team plays a crucial role in incident response and issue resolution, swiftly addressing and resolving service interruptions to maintain the highest level of customer satisfaction. With a focus on automation, scalability, reliability and security, the team enable Luminance to ensure a performant, seamless experience for its users. The Site Reliability team is a small, dynamic team of creative engineers and work together to tackle some of Luminance’s greatest challenges, with new problems and technology areas to dig into on a regular basis.
Roles and Responsibilities
System Monitoring: Implement, manage, and develop internal monitoring tools to ensure system health and quickly detect anomalies. Respond and resolve incidents efficiently to maintain uptime.
Automation: Develop automation solutions for infrastructure management, issue resolution and deployment processes, streamlining operations and reducing manual work.
Infrastructure Management: Manage cloud infrastructure to ensure reliability and scalability, collaborating with teams to design robust solutions.
Incident Management: Conduct post-incident analysis to identify root causes, implement preventive measures, and enhance system resilience.
Security and Compliance: Maintain best security practices and compliance standards, working with security teams to address vulnerabilities proactively.
Collaboration and Communication: Partner with development and operations teams, fostering communication and promoting reliability best practices across the organization.
Requirements
Masters in Computer Science, Engineering or related subject from a Go8 University
Excellent problem-solving skills, including diagnosing issues within complex systems.
Ability and desire to identify root causes of issues, and propose and implement structural improvements.
Strong communication skills and capability to perform in scenarios with urgency.
Knowledge of the design and operation of web-based software applications, based on technologies such as node.js, PostgreSQL or Elasticsearch.
Knowledge of modern infrastructure and operational tooling within cloud-based architectures, such as Linux, python, AWS, ansible, Prometheus.