Description:
Tenable's cloud-based exposure management platform helps organizations see, understand, and reduce cyber risk across their entire attack surface. Our SRE teams keep that platform reliable, scalable, and secure — and we're building the next generation of tooling to do it smarter.
This role sits within our SRE Infrastructure Management organization on a focused team dedicated to reducing operational toil through AI-powered automation. You'll build intelligent systems that replace manual workflows — from incident diagnostics to infrastructure provisioning to upgrade automation — using LLMs, agentic architectures, and deep SRE domain knowledge.
This isn't an operations role with some AI on the side. You'll spend most of your time writing production code: designing and building agentic workflows, integrating across observability and infrastructure platforms, and measuring the impact of what you ship against real toil data.
Your Opportunity
- Design and build AI-powered agentic workflows that automate complex SRE operations — incident investigation, infrastructure provisioning, deployment reliability, and more.
- Improve the accuracy, reliability, and observability of agent pipelines through evaluation frameworks, prompt engineering, retrieval strategies, and structured output validation.
- Build developer tools and internal platforms — CLI tools, IDE plugins, and workflow automation — that engineers across the organization use daily.
- Build tooling that connects across the SRE tech stack — Kubernetes, Terraform, Helm, CI/CD pipelines, observability platforms, and cloud infrastructure APIs.
- Work on a focused team where everyone writes code, owns what they ship, and drives prioritization from measured toil data.
- Participate in SRE on-call rotation — we use on-call as a direct input into what we build, not just a firefighting duty.
- Collaborate with SRE teams across the organization to identify automation opportunities and deliver tooling that gives engineers hours back.
What Success Looks Like
- Within your first few months, you've shipped an agentic workflow that automates a real SRE toil category — and engineers are using it.
- Within 6 months, you're independently designing and building AI-powered pipelines, contributing to the team's evaluation and accuracy practices, and your work is driving measurable toil reduction.
- Within a year, you've become a go-to contributor on the team — shaping the roadmap, mentoring others on AI + SRE patterns, and building systems that scale the team's impact across engineering.
What You'll Need
- 5+ years of SRE, platform engineering, or infrastructure engineering experience.
- Strong software engineering skills — you write production-quality code, not just scripts. Python is the primary language for our tooling stack.
- Experience building with LLMs and AI in production or infrastructure contexts — integrating models into real systems, not just experimentation.
- Experience building developer tools or internal platforms — CLI tools, IDE plugins, or workflow automation that other engineers use daily.
- Deep experience with Kubernetes (EKS preferred) — deployment, troubleshooting, helm chart management, and cluster operations.
- Experience with Infrastructure as Code (Terraform preferred) and CI/CD pipeline development.
- Strong experience with AWS services and APIs.
- Experience with observability platforms (Datadog, Coralogix, or similar) — both as a user during incidents and as an integration target for tooling.
- Solid background in bash scripting and Linux systems.
- Comfortable working on a distributed team with emphasis on asynchronous collaboration and documented decision-making.
- Bachelor's or Master's degree in Computer Science, Engineering, or equivalent experience.