Key Responsibilities
Cloud Infrastructure Operations
: Maintain and manage AWS services (Lambda, ECS, EKS, Redshift, Glue, SES, GuardDuty, etc.) in production, ensuring uptime, availability, and secure operations.
Incident Management
: Monitor infrastructure, manage alerts, and provide timely resolution of production incidents.
Infrastructure-as-Code (IaC)
: Design and maintain infrastructure deployment pipelines using tools like Terraform, CloudFormation, and Ansible.
Patch and Lifecycle Management
: Oversee patch management for RHEL and Windows environments using AWS Patch Manager, WSUS, and YUM/DNF, ensuring compliance with security standards.
SSL \& EOL Management
: Track SSL certificate renewals and manage end-of-life components like OS versions and Lambda runtimes.
Tool Integration \& Monitoring
: Integrate and optimize observability tools such as NGINX and work with SRE teams to enhance infrastructure monitoring.
Documentation \& Reporting
: Maintain accurate and up-to-date documentation (runbooks, change logs, post-mortems, and audit reports).
Collaboration \& Mentorship
: Collaborate with cross-functional teams and mentor junior engineers in cloud operations and best practices.
Security \& Compliance
: Ensure infrastructure adheres to strict security policies, compliance, and audit requirements.
Continuous Improvement
: Drive automation, performance optimizations, and proactive incident prevention to enhance overall cloud operations.
Key Requirements
Education
: Bachelor’s degree in Computer Science, Information Systems, or a related field.
Experience
: At least 6 years in DevOps/SRE roles, with a minimum of 4 years in public sector or regulated cloud environments.
Cloud Expertise
: Hands-on experience with AWS services in production, including services like Lambda, ECS, EKS, and more.
IaC Skills
: Proficiency in Terraform, CloudFormation, and Ansible for infrastructure automation.
OS Administration
: Strong administration skills in RHEL (v8→v9) and Windows Server (2016→2025).
Patching Expertise
: Experience managing patches across multiple operating systems using AWS Patch Manager, WSUS, and YUM/DNF.
Security \& Compliance
: Knowledge in managing SSL certificates and end-of-life (EOL) remediation processes.
Incident Management \& Troubleshooting
: Strong problem-solving and incident management skills with the ability to troubleshoot complex systems.
Soft Skills
: Excellent communication, collaboration, adaptability, time management, and continuous learning mindset.
To Apply, please kindly email your updated resume to
weizhe.teoh@tg-hr.com
Regret to inform that only shortlisted candidates will be notified.
CEI: R25127749
EA License: 14C7275