Job Title: Site Reliability Engineer
Location: Toronto, ON (Hybrid - 4x Onsite a Week)
Employment Type: Contract Opportunity
Interview Type: Face 2 Face (Onsite Interview Only)
Job Description
What is the Opportunity?
Seeking to hire a Senior Site Reliability Engineer for its Application Maintenance and Transformation, Data Services and Integration team. As a Senior Site Reliability Engineer, you will bring the engineering mindset of bold ambition, curiosity and outcome focus to ensuring the performance and reliability of our systems. This role calls for a dynamic individual who excels in a collaborative environment, interacting with cross-functional teams to establish best practices for observability, monitoring, logging, alerting, and automation.
What will you do?
Set vision for SRE product base (monitoring, alerting, self-healing, reliability testing).
Lead cross-functional collaborations to define and implement best practices for monitoring, logging, and incident response, driving a proactive stance on system health.
Function as portfolio SME (Subject Matter Expert) – understand \& document common components, core functionalities, infrastructure of supported applications.
Actively participate in deploying software applications, automation tools, and IT infrastructure.
Work closely with development teams to understand code changes and their impact on the production environment, ensuring that new releases meet our reliability standards.
Drive transformation by continuously looking for ways to automate existing SRE processes and increase operational efficiency.
Guide the technical direction for future deployments, advocating for reliability and performance improvements based on industry trends and company objectives.
Lead in incident management and problem management for applications in scope and RCA action items fulfillment/ownership.
Debug production issues across services and levels of the stack and provide primary operational support.
Perform occasional off-hours support.
Must-have
Bachelor’s degree in Computer Science, Electrical or Electronics Engineering or related field or equivalent experience.
3+ years IT experience in software development and/or maintenance or SRE or DevOps Engineering experience.
1+ years experience building Java Spring boot applications and rest API development.
Experience working on relational databases – MS-SQL Server or MySQL, MariaDB and SingleStore or in-memory distributed databases.
Experience working on Containerization platforms such as Docker and container orchestration tools like Kubernetes (Azure Kubernetes or OpenShift Kubernetes Service preferred).
Solid Git skills with experience working on popular CI tools - Jenkins or UCD
Experience working on Windows and Linux based infrastructure.
1+ years developing cloud-native applications using Java or Python.
Experience writing SQL queries and fine tuning or optimization skills.
Experience using centralized logging solutions (Splunk, Elk (preferred), etc.) and active monitoring systems (Dynatrace, etc.)
Experience deploying and operating cloud-native applications in a Private (OpenShift) or public cloud (Azure/AWS preferred)
In-depth and proactive communication skills around status of projects/issues in production
Must be a self-starter, motivated, resourceful, and driven to work with cross functional teams in large enterprises with complex org structures to meet business timelines on delivery.
Financial Services domain knowledge preferably Capital Markets and Wealth Management.
Nice-to-have
Experience implementing dashboards to help teams visualize logs, instrumentation, and other data to ensure optimal performance of the platform services, infra, and deployed applications (Grafana preferred).
Exposure to Datawarehouse’s like Informatica, Snowflake or Databricks and Business intelligence tools like SAP BO or similar.
Experience creating runbooks, processes, and test plans around reliability, performance, etc. of infrastructure and applications.
Exposure to PagerDuty, Postman, ServiceNow, SonarQube, NexusIQ and vault tools.
Exposure to event brokers like Kafka or IBM-MQ, Mainframe tools and environment,
Exposure to Industry Disaster recovery test exercises.