About Us:

Sokowatch is transforming communities across Africa by revolutionizing access to essential goods and services. By connecting small merchants to the digital economy, we fix inefficient supply chains and provide services previously unavailable to informal businesses. Sokowatch aims to provide everything a retailer needs, no distributors, or banks necessary. Thousands of retailers across Kenya, Tanzania, Rwanda, and Uganda use Sokowatch's mobile ordering and delivery platform to receive the goods they need as quickly and cheaply as possible while also accessing growth financing for the first time. We’re looking to grow our team with highly talented and motivated employees who are excited to work in a fast-paced and dynamic startup environment.

Position: Site Reliability Engineer (SRE - L1/L2) - Reporting to SRE Manager

The Site Reliability Engineering at Sokowatch fills the mission-critical role of ensuring that our complex, large-scale systems are healthy, monitored, automated, and designed to scale. We are Site Reliability Engineering team at Sokowatch, and we consist of highly skilled SRE engineers. Our prime responsibility is to bring reliability to Sokowatch production systems, applications & infrastructure. We automate applications, systems and infrastructure monitoring, cloud security, alerting, and troubleshooting. We use a lot of commercial & open-source tools and languages to accomplish our tasks like Java, MuleSoft, SAP, Magento, Python, New Relic, GCP, Tableau, etc and maintain them.

Locations: Bangalore, India or Nairobi, Kenya

Duties & Responsibilities

  • Act as a point of escalation for incidents and other issues arising within the production systems, applications & infrastructure.
  • Excellent troubleshooting, debugging and incident resolution skills for various applications, systems & infrastructure platforms across organizations.
  • Ensures timely resolution and documentation of incidents through root cause analysis & preemptive measures / actions to avoid similar incidents.
  • Monitors the organization wide applications, systems & infrastructure for faults, alarms, and other errors. Informs internal teams as required through process and procedure.
  • Expert in providing or developing stop gap scripts, code, solutions & measures to restrict the system outages & operations/business disruptions.
  • Excellent written and oral communication skills

Requirements:

  • 5+ years prior experience in large company wide site reliability engineering and management.
  • Expert on native cloud infrastructure, linux, Shell Scripting, Networking, Java/Python & Databases
  • Expert Programmer, Problem solver, TechOps, DevOps, Databases & Infrastructure engineer.
  • Bachelor or Masters degree in a quantitative field from a premier institute.
  • Sound understanding of areas in Computer Science such as Algorithms, Data Structures, Object Oriented Design, Databases. Proficiency in at least one modern programming language such as Java, Javascript or Python.