Unlock Your Future with Nexaminds!

At Nexaminds, we're on a mission to redefine industries with AI. We're passionate about the limitless potential of artificial intelligence to transform businesses, streamline processes, and drive growth.

Join us on our visionary journey. We're leading the way in AI solutions, and we're committed to innovation, collaboration, and ethical practices. Become a part of our team and shape the future powered by intelligent machines. If you're driven by ambition, success, fun, and learning, Nexaminds is where you belong.

Nexaminds is seeking for a proactive and detail oriented Jr Site Reliability Engineer to join our team. This person will play a key role in monitoring our infrastructure and applications, identifying potential incidents or outages, and coordinating with internal teams to ensure a fast and effective response.

This role is ideal for someone who is highly organized, comfortable working with tools like Slack, Opsgenie, Datadog, and has strong communication skills to triage, document, and escalate issues as needed.

Qualifications we are looking for:

2+ years of experience as an NOC, SRE, DevOps Engineer, or a related role.
Proficient in AWS or other Cloud Platform.
Hands-on experience with Datadog or other tools for monitoring and observability.
Proficiency in scripting languages (Python, Bash, etc)
Experience with CI/CD tools e.g., Jenkins(preferable), GitHub Actions, Azure DevOps.
Hands on experience with Containers.
Strong problem-solving skills and a proactive mindset.
Experience working in a technical support, NOC, or L1 operations role

Preferred Qualifications:

Experience in AWS services (e.g., EC2, EKS, VPC, IAM)
Experience with Kubernetes.
Familiarity with monitoring concepts (e.g., uptime, latency, alerts, dashboards).
Experience with other monitoring tools and logging systems.
Deep knowledge of SRE practices, including SLOs, SLIs, and KPIs, to measure reliability and performance across multiple platforms and business services.
Knowledge of networking and security best practices.
Basic experience creating new, maintaining, and troubleshooting Terraform modules and deployments.

Job duties:

Monitoring and Alerting: Watch various dashboards, Slack channels, and alerting systems (Datadog, Opsgenie, etc.) for signs of issues or system degradation.
Triage & Escalation: Identify potential incidents, gather basic context, and escalate to the appropriate team(s) based on runbooks or ownership documentation.
Communication: Act as a first point of contact for internal stakeholders during incidents, ensuring clear and timely communication.
Incident Coordination: Open and manage incident reports, track issue progress, and support during follow-ups or postmortems.
Runbook Execution and Critical Analysis: Follow predefined runbook steps to validate alerts and collect information before escalation, while also applying critical analysis and debugging skills to independently identify and troubleshoot new non-documented issues.
Continuous Improvement: Suggest and implement improvements to monitoring dashboards, runbooks, and alerting thresholds based on patterns seen.
SLO/SLI/KPI Implementation: Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure system reliability. Establish Key Performance Indicators (KPIs).
Collaboration: Work closely with development, QA, and SRE teams to ensure smooth delivery and high availability.
Documentation: Create and maintain technical documentation for infrastructure, processes, and best practices.
olid understanding of AWS and Netlify services
Able to read, maintain, and modify Terraform scripts and repositories to support business needs (e.g., incident action items like monitoring improvements)
Experience with Datadog: troubleshooting APM traces, host metrics, and log queries
Skilled in building dashboards, configuring monitoring metrics, collaborating with business stakeholders, and reviewing SLA/SLI principles
Familiar with production support processes, including incident management, CAB, and change requests.
Capable of leading P3–P5 incidents and driving RCA to closure.
Strong team player with an open mindset and excellent communication skills (this is a must) someone who can clearly convey business requirements to both IPSY and Nexaminds stakeholders
Be online and responsive, have a clear understanding of whats going on, engage the right people, and keep the margin of error low.

What you can expect from us

Here at Nexaminds, we're not your typical workplace. We're all about creating a friendly and trusting environment where you can thrive. Why does this matter? Well, trust and openness lead to better quality, innovation, commitment to getting the job done, efficiency, and cost-effectiveness.

Stock options 📈
Remote work options 🏠
Flexible working hours 🕜
Benefits above the law
But it's not just about the work; it's about the people too. You'll be collaborating with some seriously awesome IT pros.
You'll have access to mentorship and tons of opportunities to learn and level up.

Ready to embark on this journey with us? 🚀🎉 If you're feeling the excitement, go ahead and apply!

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Jr Site Reliability Engineer

New SRE Jobs

For SRE Professionals

For Employers

Company