Bachelor's degree in Computer Science, Engineering, or related field
5+ years of experience in Site Reliability Engineering or similar roles
Strong experience with cloud platforms (AWS/Azure/GCP) and infrastructure-as-code
Extensive knowledge of monitoring tools (e.g., Prometheus, Grafana, ELK Stack)
Proficiency in at least one programming language (Python, Go, or Java preferred)
Experience with containerization and orchestration (Docker, Kubernetes)
Strong understanding of networking, system design, and distributed systems
Key Responsibilities, Command Center Design & Implementation
Architect and implement a centralized command center that provides comprehensive visibility into both infrastructure and application layers
Establish standardized operational procedures, runbooks, and escalation protocols for incident management
Design and implement monitoring solutions that provide real-time insights into system health, performance metrics, and business KPIs Operations Management:
Lead the development of automated remediation solutions for common operational issues
Implement and maintain SLOs/SLIs across critical services and applications
Drive continuous improvement in incident response times and system reliability metrics
Collaborate with development teams to ensure applications are designed with operational excellence in mind Tool Development & Integration:
Develop and maintain monitoring dashboards that provide actionable insights for both technical and non-technical stakeholders
Implement and customize monitoring tools for infrastructure and application performance monitoring
Create automation scripts and tools to streamline operational processes
Integrate various monitoring and alerting systems to provide a unified view of system health Leadership & Collaboration:
Mentor junior engineers in SRE practices and command center operations
Collaborate with security, development, and infrastructure teams to ensure comprehensive monitoring coverage
Partner with business stakeholders to align monitoring strategies with business objectives
Lead post-incident reviews and drive implementation of learned improvements Preferred Qualifications:
Experience in designing and implementing enterprise-scale command centers
Knowledge of AIOps and machine learning for IT operations
Certification in relevant cloud platforms or technologies is good to have
Experience with chaos engineering and resilience testing
Background in implementing ITIL practices across any of the IT services
Excellent problem-solving and analytical abilities
Strong communication skills and ability to work with cross-functional teams
Experience in incident management and on-call rotations
Proven track record of improving system reliability and performance
Ability to handle high-pressure situations and make quick decisions