About Maneva
Maneva builds and deploys edge AI solutions powering real-time intelligence for industrial environments. Our systems run on distributed edge compute devices (NVIDIA Jetson platforms), integrate with local network cameras, PLCs, sensors, and other on-premise equipment, and securely communicate with cloud services via client- or site-based VPNs. Our customers rely on our systems around the clock, and we take reliability seriously.
Weʼre seeking a
Site Reliability Engineer (SRE)
who enjoys solving complex operational challenges, improving observability and automation, and supporting mission-critical workloads in production.
About The Role
As a Site Reliability Engineer at Maneva, you will ensure the reliability, availability, and performance of our edge AI deployments at customer sites. This includes gaining deep familiarity with Manevaʼs hardware platform, networking configurations, and application stack so that you can rapidly diagnose and resolve issues as they arise.
The role includes participating in an
on-call rotation for 24/7 incident response
, including off-hour coverage as part of a structured global support model. When not responding to incidents, you will contribute to long-term engineering initiatives around monitoring, automation, reliability, and documentation.
Responsibilities
Operational Support \& Incident Response
Serve as a first responder for production issues, alarms, and system outages (24/7 rotation required)
Troubleshoot Linux system issues, hardware problems, networking connectivity, and edge-device performance
Perform root-cause analysis (RCA) and implement corrective and preventive solutions. Document incidents, contributing to a culture of transparency and process improvement
Proactive Monitoring \& Observability
Build and maintain robust monitoring dashboards and alerts using Prometheus, Grafana, and similar tools
Continuously improve observability, including metrics, logs, traces, and health checks
Analyze trends to proactively identify reliability risks before incidents occur
Develop automation to reduce noise and improve actionable alert quality
Systems Reliability \& DevOps Engineering
Improve deployment workflows, CI/CD pipelines, configuration management, and automated provisioning
Create tools and scripts in Python/Bash to streamline operational processes
Contribute to load testing, system validation, and network health verification for edge deployments
Implement best practices for secure, scalable, and maintainable infrastructure
Infrastructure \& Application Ownership
Understand and operate Manevaʼs end-to-end edge AI stack:
Jetson/embedded Linux systems
GPU-accelerated workloads for computer vision
Video pipelines (RTSP, camera interfaces, data ingestion)
Local integrations (PLCs, industrial hardware, APIs, network resources)
VPN-based connectivity (client-based or site-to-site)
Maintain visibility into device health and fleet-wide system performance
Documentation \& Process Development
Create and maintain SOPs for on-site customer teams and internal engineering workflows
Produce detailed incident reports and reliability documentation
Maintain internal knowledge bases, troubleshooting guides, and playbooks
Requirements
Technical Skills
Strong Linux systems administration experience (Ubuntu, embedded Linux, ARM systems)
Proficiency in Python and/or Bash for scripting and operations automation
Solid networking fundamentals: TCP/IP, routing, DNS, DHCP, VPNs, VLANs, firewall rules
Familiarity with troubleshooting tools: tcpdump, nmap, iftop, netstat, etc.
Hands-on experience with Prometheus, Grafana, or similar monitoring/alerting platforms
Experience with logging/observability stacks (ELK/EFK, Loki, Fluentd, etc.) is a plus
Experience with Docker or containerized applications is desirable
Comfort supporting distributed or remote device fleets
Soft Skills
Excellent diagnostic and analytical abilities under pressure
Strong communication skills with both technical and non-technical stakeholders
High ownership mentality and ability to follow issues through to resolution
Comfortable working independently in a fully remote environment
Willingness to participate in on-call rotation, including off-hours and weekends
Preferred Qualifications
Experience supporting machine learning, computer vision, or GPU-accelerated systems
Familiarity with NVIDIA Jetson or other embedded AI hardware
Prior SRE/DevOps/Systems Engineer experience in a 24/7 operational environment
Exposure to industrial IoT, manufacturing systems, or operational technology (OT)
Experience writing customer-facing operational documentation or SOPs
Benefits
What We Offer
Fully remote work environment with flexibility (within on-call requirements)
Opportunities to work with cutting-edge edge compute and AI deployments
A high-impact role shaping reliability practices from early stages
Contract or full-time options, with competitive compensation
A collaborative team committed to transparency, improvement, and excellence