Reviewing system performance metrics and addressing any anomalies.
Leading incident response calls and coordinating with relevant teams.
Meeting with stakeholders to discuss reliability goals and progress.
Developing scripts and automation tools for system maintenance tasks.
Conducting training sessions for team members on best practices.
Planning and executing system upgrades and infrastructure improvements.
Detailed Job Description
Monitoring and Performance: Setting up and maintaining monitoring tools and dashboards to track system performance and detect issues proactively.
Team Leadership: Leading and mentoring the SRE team, ensuring they have the resources and guidance needed to perform their roles effectively.
System Design and Architecture: Overseeing the design and architecture of reliable systems, ensuring scalability, fault tolerance, and high availability.
Incident Management: Coordinating response to incidents, conducting post-mortems, and implementing measures to prevent recurrence.
Automation: Developing and promoting automation for repetitive tasks to reduce human error and improve efficiency.
Stakeholder Management: Meeting regularly with stakeholders to discuss reliability goals, project progress, and challenges faced in achieving high system reliability.
Training & Development: Conducting training sessions for team members on best practices, new tools, and techniques to enhance their skill sets.
Skills: architecture,management,automation,training,training and development,dashboards,system design,automation tools,monitoring tools,incident response,stakeholder management,reliability engineering,reliability