Senior Staff Site Reliability Engineer

 

Description:

The SRE org at Matillion is made up of multiple teams which combined, own the operation and efficiency of our cloud platforms and services. It covers everything from the build, provisioning and maintenance of our cloud Infrastructure as well as reliability, capability management, observability, monitoring and metrics of our SaaS platform.

Reporting into the Director of SRE and Observability, you will utilise your experience across all pillars of Site Reliability Engineering to drive best practice aimed at enhancing our ability to build truly reliable, observable and performative infrastructure for all our core services. Your experience building modern, multi-cloud platforms will play a pivotal role as we continue to modernise our stack and implement a wide range of new tools around logging, monitoring, metrics and alerting.

This role can be fully remote in Ireland with some travel to our UK office required on a quarterly basis.

Technologies You’ll Use... Kubernetes, AWS, ArgoCD, Terraform, DataDog, Prometheus, Golang/ Python.

What You’ll Be Doing:
 

  • Leading the design of major software components, systems, and features to improve the availability, scalability, latency, and efficiency of Matillion’s SaaS services
  • Drive the design, implementation and management for expanding observability infrastructure, keeping up to date with new tools and technologies and be a recognised member of the broader Observability community
  • Lead sustainable incident response, blameless postmortems, and production improvements that result in direct business opportunities for Matillion
  • Define and document best practices across all pillars of SRE
  • Providing guidance and mentorship to other team members on managing end-to-end availability and performance of critical services, design techniques and coding standards to cultivate innovation and collaboration across the business
  • Balancing competing priorities as you manage a range of individual projects, deadlines, and deliverables
     

What we’re looking for:
 

  • A passion for everything performance, observability, availability, scalability and security
  • Extensive experience with Kubernetes and the surrounding ecosystem with tools such as Linkerd, Traefik and ArgoCD
  • An adopter and champion of core SRE principles including SLA’s, SLO’s, automation, proactive monitoring, release and deployment
  • Exposure to working with high traffic, large scale web operations in AWS
  • Ability to manage and provision infrastructure using code with Terraform or CloudFormation as well as build internal tooling with the likes of Go or Python
  • A solid understanding of networking systems and protocols

Organization Matillion
Industry Engineering
Occupational Category Senior Staff Site Reliability Engineer
Job Location Dublin,Ireland
Shift Type Morning
Job Type Full Time
Gender No Preference
Career Level Intermediate
Experience 2 Years
Posted at 2024-01-01 4:45 pm
Expires on 2024-05-27