Manager, Cloud Platforms
Site Reliability Engineering
Information Technology/Infrastructure
Remote: CST or EST time zone highly preferred!
OVERVIEW
As FTD’s Cloud Platform and SRE Manager you will champion a technological and cultural transformation toward DevOps and SRE practices, enabling efficient delivery and operation of high quality, reliable, secure software at scale. As a hands-on leader you will architect, engineer, optimize, and operate our Google Cloud platform environment, including Google Kubernetes Engine; re-envision and innovate Continuous Integration & Continuous Delivery and Infrastructure as Code solutions; and incubate and proliferate Site Reliability Engineering principles and practices to ensure the stability and reliability of our commerce platforms.
KEY RESPONSIBILITIES
- Provide thought leadership and strategic guidance to FTD’s technology division in cloud architectures, as well as DevOps and SRE principles and practices
- Lead and develop a team of engineers engaged in Google Cloud architecture and engineering, CI/CD, Site Reliability Engineering, Kubernetes administration, and related operational support
- Drive adoption of SRE principles including SLOs and SLIs, error budgets, metrics-driven observability and decision-making, automating repetitive tasks, chaos engineering, and incident and problem management processes
- Collaborate with technology teams to streamline, document and support CI/CD automation leveraging Jenkins, Bitbucket, and other tools, with an eye toward modernization and innovation
- Provision and manage cloud resources and configuration using Terraform, Google Cloud SDK, kubectl, Google Cloud Console and other tools, and drive adoption of consistent provisioning practices
- Promote development and security best practices and implement supporting automation, with a “shift left” mentality
- Implement and maintain effective infrastructure and application observability solutions to improve visibility and streamline incident detection, response, and prevention
- Troubleshoot and resolve an array of issues in CI/CD, Google Cloud Platform (GCP), Google Kubernetes Engine (GKE) and other technologies
- Provide leadership in incident response and problem management activities to rapidly restore service and subsequently prevent recurrence
- Perform continuous cloud cost analysis, attribution, and optimization
- Maintain compliance with relevant security frameworks (e.g. SOC 2, CIS), standards (e.g. PCI-DSS) and regulations (e.g. CCPA), including participation in audits and assessments
- Promote and practice agile workflows and processes within your team (Kanban preferred)
- Create and maintain technical, procedural, and educational documentation and diagrams related to FTD’s network ecosystem
- Embrace a culture of collaboration, enablement, customer service, continuous improvement, transparency, and financial responsibility
- Perform other duties as directed
KNOWLEDGE, SKILLS AND ABILITIES
- Bachelor's or advanced degree in Computer Science, Information Systems, or a related field, or equivalent experience
- 5+ years architecting, delivering, and operating scalable, reliable, high-performance, and secure infrastructure and applications in on-prem and cloud environments (Google Cloud Platform or similar)
- 2+ years managing a high-performing team(s) in close collaboration with resources in various technical disciplines
- 2+ years in software engineering with languages such as Java, C#, Python, JavaScript, and related frameworks, ideally in a fast-paced 24x7 e-commerce environment
- Google Professional Cloud Architect or similar certification desired
- Broad experience in infrastructure technologies including networking, systems engineering, databases, information security, virtualization, backup and restore, observability, etc.
- Advanced experience with CI/CD methodologies and technologies, including Jenkins implementations leveraging various plug-ins and Groovy-based customizations
- Proficiency with microservices principles and orchestration, including containerization (e.g. Docker) and Kubernetes (e.g. Google Kubernetes Engine)
- Experience with rapid detection and troubleshooting of technical issues using various observability and Application Performance Monitoring (APM) tools
- Strong experience leveraging Infrastructure as Code (e.g. Terraform) and related tools for infrastructure provisioning and configuration
- Excellence in navigating and prioritizing multiple simultaneous responsibilities of varying scope and complexity
- Ability to effectively articulate technical concepts via oral, written, and other non-verbal communications to audiences at varying levels of proficiency
- Demonstrated desire and ability to be self-directed, take ownership of issues, share knowledge, and establish a prominent level of credibility
- Ability to operate effectively under pressure, both independently and in collaboration with others across multiple disciplines
DIRECT REPORTS
- Senior DevOps Engineer (US)
- Senior Cloud Platform Engineer (India)
- Cloud Platform Engineer (India)
- Site Reliability Engineer (India)
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by applicable laws, regulations and ordinances.