About this role
An exciting opportunity is available for an experienced Site Reliability Engineer (SRE) to join a growing technology team in Nairobi. This full-time onsite position is ideal for professionals who are passionate about cloud infrastructure, automation, system reliability, and operational excellence.
The role focuses on ensuring the availability, scalability, and performance of cloud-based systems while helping build resilient infrastructure that supports business growth.
Job Overview
Position: Site Reliability Engineer (SRE)
Employment Type: Full-Time, Onsite
Location: Nairobi, Kenya
Industry: Information Technology & Engineering
Experience Required: 3–5 Years
Salary Range: KSh 50,000 – KSh 100,000 per Month
Education Level: Bachelor's Degree, Higher National Diploma (HND), or Equivalent Qualification
About the Role
As a Site Reliability Engineer, you will be responsible for maintaining highly available systems, improving deployment processes, automating operational tasks, and strengthening infrastructure reliability.
You will work closely with development teams to ensure applications are scalable, observable, and capable of delivering a seamless user experience.
Key Responsibilities
Build and Maintain CI/CD Pipelines
Design and implement reliable Continuous Integration and Continuous Deployment (CI/CD) pipelines.
Improve software delivery speed while maintaining system stability.
Support efficient and repeatable deployment processes.
Cloud Infrastructure Management
Manage cloud environments and services hosted on AWS.
Deploy and maintain applications using:
Amazon ECS
Amazon EC2
Application Load Balancers (ALB)
Ensure infrastructure remains scalable, secure, and cost-efficient.
Monitoring and Observability
Monitor system performance and application health.
Create dashboards, alerts, and reporting mechanisms.
Analyze logs and metrics to identify and resolve issues proactively.
Automation and Scripting
Automate repetitive operational tasks using Python and Bash.
Improve operational efficiency through workflow automation.
Reduce manual intervention across infrastructure processes.
Incident Management
Lead incident response activities during system disruptions.
Conduct root cause analysis following incidents.
Implement corrective measures to prevent recurring issues.
Disaster Recovery and Resilience
Develop and maintain disaster recovery strategies.
Manage backups and failover mechanisms.
Conduct resilience testing to ensure business continuity.
Developer Collaboration
Partner with software engineering teams to optimize deployment workflows.
Improve application instrumentation and observability.
Support performance optimization initiatives.
Required Qualifications
Education
Applicants should possess:
A Bachelor's Degree in Computer Science, Information Technology, Engineering, or a related field; or
An equivalent Higher National Diploma (HND).
Professional Experience
Candidates should have:
3 to 5 years of experience in:
Site Reliability Engineering (SRE)
DevOps Engineering
Cloud Operations
Infrastructure Engineering
Proven experience managing production cloud environments.
Required Technical Skills
AWS Cloud Expertise
Strong hands-on experience with:
Amazon ECS
Amazon EC2
Application Load Balancer (ALB)
Cloud infrastructure management
Monitoring and Observability Tools
Experience working with:
Prometheus
Grafana
Loki
ELK Stack
Containerization
Advanced knowledge of Docker.
Experience deploying and managing containerized applications.
CI/CD Development
Building and maintaining deployment pipelines.
Automating software release processes.
Programming and Automation
Proficiency in:
Python
Bash scripting
Troubleshooting
Strong analytical and problem-solving skills.
Ability to investigate and resolve complex production issues efficiently.
Preferred Qualifications
Candidates with the following additional skills will have an advantage:
Infrastructure as Code (IaC)
Experience using Terraform for infrastructure automation.
Kubernetes Experience
Familiarity with Kubernetes environments, especially Amazon EKS.
Database Operations
Knowledge of MongoDB Atlas administration and monitoring.
Cost Optimization
Experience improving cloud resource utilization and reducing infrastructure costs.
What Success Looks Like
Successful performance in this role will result in:
Reliable Systems
Highly available and scalable infrastructure.
Reduced downtime and improved service reliability.
Enhanced Visibility
Clear monitoring and observability across all applications and services.
Faster issue detection and response.
Improved Incident Management
Reduced frequency of critical incidents.
Faster recovery times when issues occur.
Increased Automation
Efficient deployment and operational workflows.
Reduced manual processes and improved productivity.
Why Join This Opportunity?
This position offers the chance to work with modern cloud technologies, automation tools, and scalable infrastructure solutions. It is an excellent opportunity for professionals looking to deepen their expertise in Site Reliability Engineering while contributing to high-impact technology projects.
Conclusion
If you have experience in DevOps, cloud infrastructure, or Site Reliability Engineering and are passionate about building reliable and scalable systems, this Nairobi-based opportunity could be the next step in your technology career.
Application Deadline
18 July 2026

