At South Star, we are committed to empowering businesses in the telecom sector by providing comprehensive support and management solutions for their IT infrastructure. Based in Navi Mumbai, we specialize in offering a wide range of services, including application environment support, IT infrastructure environment management, network monitoring, end-user support, and service design \& monitoring. Our focus on the telecom sector allows us to deliver exceptional results that are tailored to our unique clients.
Observability Engineer
Department: Corporate IT
Reports to: Director
South Star Software Private Limited is committed to fostering innovation, collaboration, and technical excellence in every aspect of its operations. As an Observability Engineer at South Star, you will join a dynamic team that values continuous learning and encourages initiative in driving transformative solutions across our platforms.
Position Summary
We are looking for an Observability Engineer to design, implement, and manage our enterprise-level monitoring and observability infrastructure. The successful candidate will be responsible for architecting robust observability solutions utilizing industry-leading platforms such as Grafana, Prometheus, AppDynamics, and Splunk Observability. This position supports engineering teams by providing advanced dashboards, effective alerting mechanisms, and comprehensive data correlation that deliver critical insights into system performance, reliability, and behavior.
Key Responsibilities
Architecture \& Design
Design and implement scalable observability architectures that support monitoring across cloud, on-premises, and hybrid environments
Establish observability standards, patterns, and best practices across the organization
Evaluate and integrate new monitoring technologies and tools to enhance visibility capabilities
Design data retention, aggregation, and storage strategies for metrics, logs, and traces
Platform Management
Deploy, configure, and maintain enterprise monitoring platforms including Grafana, Prometheus, AppDynamics, and Splunk Observability
Ensure high availability, performance, and scalability of observability infrastructure
Manage platform upgrades, patches, and capacity planning
Integrate observability tools with existing CI/CD pipelines and infrastructure automation
Dashboard \& Visualization Development
Create and maintain comprehensive dashboards that provide actionable insights for application and infrastructure teams
Build executive-level reporting dashboards for system health and performance metrics
Develop custom visualizations tailored to specific business and technical requirements
Implement role-based access and dashboard governance
Alerting \& Incident Response
Design intelligent alerting strategies that minimize noise and prioritize critical issues
Configure multi-channel alert routing and escalation policies
Establish SLI/SLO/SLA frameworks and implement corresponding monitoring
Collaborate with incident response teams to improve detection and diagnosis capabilities
Conduct post-incident reviews to enhance monitoring coverage and alert accuracy
Collaboration \& Enablement
Partner with development, operations, and security teams to instrument applications and infrastructure
Provide guidance on observability best practices, including logging standards, metrics collection, and distributed tracing
Conduct training sessions and create documentation for observability tools and practices
Act as subject matter expert for monitoring-related questions and troubleshooting
Required Qualifications
3-5+ years of experience with enterprise monitoring and observability platforms
Hands-on expertise with Grafana, Prometheus, AppDynamics, and Splunk Observability (or similar tools)
Strong understanding of monitoring fundamentals: metrics, logs, traces, and events
Experience with containerized environments (Kubernetes, Docker)
Proficiency in scripting languages (Python, Bash, PowerShell) for automation
Knowledge of application performance monitoring (APM) concepts and practices
Experience with configuration management tools (Ansible, Terraform) for infrastructure as code
Understanding of networking, system administration, and distributed systems architecture
Preferred Qualifications
Experience with OpenTelemetry and distributed tracing implementations
Familiarity with PromQL, SPL (Splunk Processing Language), and other query languages
Knowledge of time-series databases (InfluxDB, TimescaleDB, Prometheus TSDB)
Experience implementing SRE practices and establishing SLI/SLO frameworks
Background in software development or DevOps engineering
Certifications in relevant monitoring platforms or cloud technologies
Experience in regulated industries with compliance monitoring requirements
Technical Skills
Monitoring Platforms: Grafana, Prometheus, AppDynamics, Splunk Observability
Scripting/Programming: Python, Bash, Go, PowerShell
Container Orchestration: Kubernetes, Docker, container monitoring best practices
Configuration Management: Ansible, GitOps workflows
Data Formats: JSON, YAML, Prometheus exposition format
Version Control: Git, GitLab/GitHub
Personal Attributes
Strong analytical and problem-solving abilities
Excellent communication skills with ability to explain complex technical concepts
Self-motivated with ability to work independently and prioritize effectively
Detail-oriented with commitment to documentation and knowledge sharing
Collaborative mindset with focus on enabling team success
Shift Schedule
The standard in-office schedule is Monday to Friday, from 11:00 am to 8:00 pm IST. Remote work is permitted during maintenance windows.