Job Summary
We are seeking a Director of Site Reliability Engineering (SRE) to lead and evolve our reliability, observability, and automation initiatives across cloud-native, multi-tenant systems. This role is critical in driving uptime, performance, and efficiency for our production and customer-facing environments while fostering a culture of continuous improvement and operational excellence.
Essential Functions
- Evolve a high-performing SRE team into a strategic, forward-leaning engineering force focused on innovation, automation, and measurable business impact
- Define and drive an advanced SRE roadmap centered on self-healing systems, adaptive scaling, and platform resilience
- Advance existing SLAs, SLOs, and SLIs into predictive, business-aligned reliability models; formalize executive-level SLO reporting
- Lead efforts to evolve observability into a proactive, AI/ML-driven capability for anomaly detection, early warning, and service health forecasting
- Strengthening incident response by integrating intelligent automation, enhancing runbooks, and refining on-call strategies for faster mitigation
- Expand chaos engineering and resilience testing practices across critical systems; institutionalize capacity stress testing and failover validation
- Refine CI/CD pipelines to support safe, high-frequency deployments with zero-touch rollback and dynamic environment provisioning
- Institutionalize Infrastructure as Code (IaC) patterns to drive repeatable, auditable infrastructure operations at scale
- Optimize FinOps practices with actionable insights into cost vs. performance tradeoffs and service-level ROI
- Drive deeper integration between SRE, Security, and Compliance for faster detection, triage, and resolution of security incidents
- Balance system reliability and deployment velocity by analyzing error rates and stability indicators
- Conduct Blameless Postmortems (BPM) for priority 1 incidents
- Provide go-live leadership for high-stakes brand launches and system expansions on the NextGen platform
- Partner with architecture and product teams to embed observability, scalability, and cost awareness into solution design
- Modernize disaster recovery operations to meet aggressive RTO/RPO objectives with fully automated failover mechanisms
- Resolve technical debt, and avoid creating new technical debt
- Oversee vendor performance, contract renewals, and third-party compliance across tooling and infrastructure partnerships
- Ensure quarterly contractor audits, identity governance, and system access reviews are thorough and timely
- Cultivate a culture of continuous learning, experimentation, and innovation through coaching, advanced training, and stretch assignments
- Develop continuous improvement framework based on agile retrospectives, SLIs, and service reviews
- Elevate the team's visibility and influence across the organization by aligning technical outcomes with business value
Education
- Bachelor’s Degree in Information Systems or related discipline; required
Work Experience
- Minimum 10 years of experience in software development or information technology
- Minimum 5 years working with cloud-native solutions, preferably with Azure
- Minimum 5 years of experience in DevOps and/or Site Reliability Engineering
- Minimum 4 years of people management (hiring, mentoring, and managing engineering staff)
- Strong knowledge of Infrastructure as Code (IaC)
- Experience with pipeline based SDLC CI/CD automation
- Experience working on a scrum team
Skills
- Ability to communicate complex, technical concepts to executive team, business leaders and franchisees.
- Ability to develop and maintain positive business relationships and foster an environment of mutual respect, understanding, trust, and support.
- Ability to coach employees in a positive manner.
- Ability to facilitate the resolution of different views.
- Ability to collect information from others without putting it in a defensive posture.
- Ability to adapt and adjust planned work through analyzing work demands, competing priorities, and tight deadlines; to understand the most effective and efficient means to accomplish tasks within the parameters of the organizational structure, processes, systems, and policies.
- Ability to exercise judgment and discretion in dealing with matters of significance and sensitive nature.
- Excellent organizational communication and leadership skills.
- Excellent analytical and problem-solving skills.
- Ability to develop, communicate and implement strategies and tactics.
- Strong business acumen and sense of urgency to achieve business results.
CertificationsTravel Requirement