Find Your Dream Job

Search through thousands of job postings to find your next opportunity

Date Posted

Job Type

Technology

Work Setting

Salary Range

$0k $100k $200k+

Experience Level

Director, Site Reliability Engineering

GoTo Foods

Atlanta, GA

Job Summary

We are seeking a Director of Site Reliability Engineering (SRE) to lead and evolve our reliability, observability, and automation initiatives across cloud-native, multi-tenant systems. This role is critical in driving uptime, performance, and efficiency for our production and customer-facing environments while fostering a culture of continuous improvement and operational excellence.

Essential Functions

  • Evolve a high-performing SRE team into a strategic, forward-leaning engineering force focused on innovation, automation, and measurable business impact
  • Define and drive an advanced SRE roadmap centered on self-healing systems, adaptive scaling, and platform resilience
  • Advance existing SLAs, SLOs, and SLIs into predictive, business-aligned reliability models; formalize executive-level SLO reporting
  • Lead efforts to evolve observability into a proactive, AI/ML-driven capability for anomaly detection, early warning, and service health forecasting
  • Strengthening incident response by integrating intelligent automation, enhancing runbooks, and refining on-call strategies for faster mitigation
  • Expand chaos engineering and resilience testing practices across critical systems; institutionalize capacity stress testing and failover validation
  • Refine CI/CD pipelines to support safe, high-frequency deployments with zero-touch rollback and dynamic environment provisioning
  • Institutionalize Infrastructure as Code (IaC) patterns to drive repeatable, auditable infrastructure operations at scale
  • Optimize FinOps practices with actionable insights into cost vs. performance tradeoffs and service-level ROI
  • Drive deeper integration between SRE, Security, and Compliance for faster detection, triage, and resolution of security incidents
  • Balance system reliability and deployment velocity by analyzing error rates and stability indicators
  • Conduct Blameless Postmortems (BPM) for priority 1 incidents
  • Provide go-live leadership for high-stakes brand launches and system expansions on the NextGen platform
  • Partner with architecture and product teams to embed observability, scalability, and cost awareness into solution design
  • Modernize disaster recovery operations to meet aggressive RTO/RPO objectives with fully automated failover mechanisms
  • Resolve technical debt, and avoid creating new technical debt
  • Oversee vendor performance, contract renewals, and third-party compliance across tooling and infrastructure partnerships
  • Ensure quarterly contractor audits, identity governance, and system access reviews are thorough and timely
  • Cultivate a culture of continuous learning, experimentation, and innovation through coaching, advanced training, and stretch assignments
  • Develop continuous improvement framework based on agile retrospectives, SLIs, and service reviews
  • Elevate the team's visibility and influence across the organization by aligning technical outcomes with business value

Education

  •  Bachelor’s Degree in Information Systems or related discipline; required

Work Experience

  • Minimum 10 years of experience in software development or information technology
  • Minimum 5 years working with cloud-native solutions, preferably with Azure
  • Minimum 5 years of experience in DevOps and/or Site Reliability Engineering
  • Minimum 4 years of people management (hiring, mentoring, and managing engineering staff)
  • Strong knowledge of Infrastructure as Code (IaC)
  • Experience with pipeline based SDLC CI/CD automation
  • Experience working on a scrum team

Skills

  • Ability to communicate complex, technical concepts to executive team, business leaders and franchisees.
  • Ability to develop and maintain positive business relationships and foster an environment of mutual respect, understanding, trust, and support.
  • Ability to coach employees in a positive manner.
  • Ability to facilitate the resolution of different views.
  • Ability to collect information from others without putting it in a defensive posture.
  • Ability to adapt and adjust planned work through analyzing work demands, competing priorities, and tight deadlines; to understand the most effective and efficient means to accomplish tasks within the parameters of the organizational structure, processes, systems, and policies.
  • Ability to exercise judgment and discretion in dealing with matters of significance and sensitive nature.
  • Excellent organizational communication and leadership skills.
  • Excellent analytical and problem-solving skills.
  • Ability to develop, communicate and implement strategies and tactics.
  • Strong business acumen and sense of urgency to achieve business results.

CertificationsTravel Requirement

  • None

New SRE Jobs

Connecting top SRE talent with leading companies.

For SRE Professionals

For Employers

Company