An exciting opportunity is available for an experienced Site Reliability Engineer (SRE) to join a growing technology team in Nairobi. This full-time onsite position is ideal for professionals who are passionate about cloud infrastructure, automation, system reliability, and operational excellence.

The role focuses on ensuring the availability, scalability, and performance of cloud-based systems while helping build resilient infrastructure that supports business growth.

Job Overview

Position: Site Reliability Engineer (SRE)

Employment Type: Full-Time, Onsite

Location: Nairobi, Kenya

Industry: Information Technology & Engineering

Experience Required: 3–5 Years

Salary Range: KSh 50,000 – KSh 100,000 per Month

Education Level: Bachelor's Degree, Higher National Diploma (HND), or Equivalent Qualification

About the Role

As a Site Reliability Engineer, you will be responsible for maintaining highly available systems, improving deployment processes, automating operational tasks, and strengthening infrastructure reliability.

You will work closely with development teams to ensure applications are scalable, observable, and capable of delivering a seamless user experience.

Key Responsibilities

Build and Maintain CI/CD Pipelines

Design and implement reliable Continuous Integration and Continuous Deployment (CI/CD) pipelines.
Improve software delivery speed while maintaining system stability.
Support efficient and repeatable deployment processes.

Cloud Infrastructure Management

Manage cloud environments and services hosted on AWS.
Deploy and maintain applications using:
- Amazon ECS
- Amazon EC2
- Application Load Balancers (ALB)
Ensure infrastructure remains scalable, secure, and cost-efficient.

Monitoring and Observability

Monitor system performance and application health.
Create dashboards, alerts, and reporting mechanisms.
Analyze logs and metrics to identify and resolve issues proactively.

Automation and Scripting

Automate repetitive operational tasks using Python and Bash.
Improve operational efficiency through workflow automation.
Reduce manual intervention across infrastructure processes.

Incident Management

Lead incident response activities during system disruptions.
Conduct root cause analysis following incidents.
Implement corrective measures to prevent recurring issues.

Disaster Recovery and Resilience

Develop and maintain disaster recovery strategies.
Manage backups and failover mechanisms.
Conduct resilience testing to ensure business continuity.

Developer Collaboration

Partner with software engineering teams to optimize deployment workflows.
Improve application instrumentation and observability.
Support performance optimization initiatives.

Required Qualifications

Education

Applicants should possess:

A Bachelor's Degree in Computer Science, Information Technology, Engineering, or a related field; or
An equivalent Higher National Diploma (HND).

Professional Experience

Candidates should have:

3 to 5 years of experience in:
- Site Reliability Engineering (SRE)
- DevOps Engineering
- Cloud Operations
- Infrastructure Engineering
Proven experience managing production cloud environments.

Required Technical Skills

AWS Cloud Expertise

Strong hands-on experience with:

Amazon ECS
Amazon EC2
Application Load Balancer (ALB)
Cloud infrastructure management

Monitoring and Observability Tools

Experience working with:

Prometheus
Grafana
Loki
ELK Stack

Containerization

Advanced knowledge of Docker.
Experience deploying and managing containerized applications.

CI/CD Development

Building and maintaining deployment pipelines.
Automating software release processes.

Programming and Automation

Proficiency in:

Python
Bash scripting

Troubleshooting

Strong analytical and problem-solving skills.
Ability to investigate and resolve complex production issues efficiently.

Preferred Qualifications

Candidates with the following additional skills will have an advantage:

Infrastructure as Code (IaC)

Experience using Terraform for infrastructure automation.

Kubernetes Experience

Familiarity with Kubernetes environments, especially Amazon EKS.

Database Operations

Knowledge of MongoDB Atlas administration and monitoring.

Cost Optimization

Experience improving cloud resource utilization and reducing infrastructure costs.

What Success Looks Like

Successful performance in this role will result in:

Reliable Systems

Highly available and scalable infrastructure.
Reduced downtime and improved service reliability.

Enhanced Visibility

Clear monitoring and observability across all applications and services.
Faster issue detection and response.

Improved Incident Management

Reduced frequency of critical incidents.
Faster recovery times when issues occur.

Increased Automation

Efficient deployment and operational workflows.
Reduced manual processes and improved productivity.

Why Join This Opportunity?

This position offers the chance to work with modern cloud technologies, automation tools, and scalable infrastructure solutions. It is an excellent opportunity for professionals looking to deepen their expertise in Site Reliability Engineering while contributing to high-impact technology projects.

Conclusion

If you have experience in DevOps, cloud infrastructure, or Site Reliability Engineering and are passionate about building reliable and scalable systems, this Nairobi-based opportunity could be the next step in your technology career.

Site Reliability Engineer (SRE)

About this role