Description:
As a Manager of the Site Reliability Engineering Tooling team, you will provide technical leadership and hands-on code contributions, incorporating reliability best practices for programming and scripting, observability, production triage, incident resolution, and retrospective/root cause analysis to maintain the world-class reliability and uptime of our platform.
About this roll* (Responsibilities)
- Enable a geographically distributed team of talented engineers to continue performing at a high level and help increase the impact of their work
- Drive day-to-day operations of the team and contribute to the development and prioritization of the SRE roadmap for major initiatives
- Create and drive strategic organization-wide scalability, observability, and reliability initiatives in collaboration with technical leadership and Product Management
- Influence architecture decisions for your team and for individual services to optimize resilience and scalability
- Guide teams to build and maintain systems that are reliable and available for Toast customers
- Facilitate professional growth by mentoring engineers on your team
Do you have the right ingredients*? (Requirements)
- Hands-on experience managing or leading an Internal tools team, including hiring, mentoring, cross-functional collaboration
- Hands-on coding experience with multiple coding languages - Java/JVM required + one or more of Kotlin, Go, Python, etc.
- Background in leading complex engineering projects in a Scrum environment
- Experience in building and running distributed systems
- Exposure to networking, cloud architectures, and patterns
- Deep understanding of systems, networking, and scaling issues
- Direct exposure to cloud infrastructure and SaaS solutions