
Site Reliability Engineer
- On-site, Hybrid
- Tehran, Tehrān, Iran, Islamic Republic of
- Tech
Job description
About Snapp
Snapp is the pioneer provider of ride-hailing mobile solutions in Iran that connects smartphone owners in need of a ride to Snapp drivers who use their private cars to offer transportation services. We are ambitious, passionate, engaged, and excited about pushing the boundaries of the transportation industry to new frontiers and being the first choice of each user in Iran.
About the Role
We’re looking for a Site Reliability Engineer (SRE) to help scale and stabilize our systems as we grow. As part of the SRE team, you'll work on automating operations, managing incidents, and supporting the infrastructure that enables our developers and QA teams to build and release with confidence.
You'll be responsible for improving system reliability, monitoring, and observability while ensuring high availability across environments. This role includes participation in a 24/7 shift or on-call rotation.
What You’ll Do
Manage Incidents: Respond to incidents, perform root cause analysis, and help drive resolution and recovery.
Monitor & Alert: Improve and tune monitoring systems (Grafana, Prometheus) to ensure issues are detected early.
Participate in On-Call: Join a rotating on-call schedule to monitor systems and respond to critical alerts.
Collaborate Across Teams: Work closely with developers, QA, and product engineers to support releases and operational improvements.
Improve Stability: Proactively identify and fix reliability issues that could affect production uptime.
Automate Operations: Build scripts and tools to eliminate manual work and reduce operational overhead.
Deploy Services: Assist in deploying and maintaining services across staging and production environments.
Support Staging: Troubleshoot and resolve issues in pre-production environments to unblock QA and development teams.
Job requirements
At least 2 years of experience in a DevOps, SRE, or infrastructure engineering role
Solid understanding of SRE principles: SLIs, SLOs, SLAs, Error Budgets
Experience with Python (or another scripting language)
Hands-on experience with CI/CD tools and pipelines
Comfortable with Linux systems administration
Experience with monitoring and observability tools: Prometheus, Grafana
Familiarity with logging stacks (e.g., ELK, Loki) and tracing tools (Jaeger, Tempo)
Knowledge of databases such as PostgreSQL/MySQL and Redis
Practical experience with Kubernetes, Docker, and Helm
Bonus Points
Experience working in a microservices or distributed system environment
or
All done!
Your application has been successfully submitted!