I. MAJOR RESPONSIBILITIES AND DUTIES:
- Configure and maintenance of the enterprise monitoring tool to provide realtime visibility and state of health across the technology stack
- Design and create dashboards to provide multi-level view based on functional requirement such as executive and tactical views
- Create and maintain key threshold across all monitoring elements to ensure proactive detection and early detection of impending incident or problem
- Analyze events and correlate to all observability and monitoring tools to capture trends and behavior patterns to assist in proactive course of actions
- Design, develop and utilize automation tools and scripts to address repetitive actions and where possible create correction course of action to prevent and/or reduce prolonged outages
- Work closely with operations team during incident and problem management for quick reaction response as identified using the monitoring tools
- Regularly review and optimize infrastructure performance using logs, metrics and traces as part of continuous improvements thru adjustment of thresholds and monitoring requirement as environment constantly change
- Develop and maintain a robust alerting strategy, including integration with on-call tools to ensure timely escalation and resolution of critical issues.
- Implement and manage end-to-end event lifecycle processes to ensure accurate incident detection and efficient response.
II. JOB SPECIFICATIONS:
Educational Requirement:
- Bachelor’s degree in Computer Science, Information Technology, or a related field; or equivalent work experience.
Experience Requirement:
- 2–5+ years of extensive experience as systems and network administrator
- Hand-on experience managing monitoring tools such as but not limited to Solarwinds, Nagios, etc.
- Evident understand what Observability and what it does
Skills and Attributes:
- Proficient with major cloud platforms such as AWS, GCP, Azure and Alibaba Cloud
- Hands-on experience with SNMP based monitoring tools such as Solarwinds, Nagios, CheckMK, etc.
- Good grasp on Observability platform such as Splunk and Dynatrace
- Experience with containerization platform such as Docker and Kubernetes
- Extensive experience with virtualization technology such as VMWare
- Strong knowledge of networking using collapsed architecture or similar enterprise networking technology
- Knowledgeable in scripting languages such as Python, Bash, or PowerShell.
- AWS Certified Solutions Architect, Azure Solutions Architect, or equivalent certification.
- Certified Kubernetes Administrator (CKA)Solid understanding of disaster recovery and business continuity practices.
Other Qualifications:
- Strong analytical skills to identify, troubleshoot, and resolve complex technical issues.
- Excellent verbal and written communication skills for interacting with team members, stakeholders, and end-users. Ability to explain technical concepts to non-technical audiences.
- Ability to work effectively in a team environment and collaborate with other IT Groups
- Effective prioritization and management of multiple tasks and projects.
- Flexibility to adapt to changing technologies, tools, and business requirements.
- Proactive in identifying areas for improvement and suggesting enhancements.
- Should be able to train junior team members
- Ability to work under pressure and remain decisive