KEY RESPONSIBILITIES

Site Reliability Engineer (SRE) will assist the SRE team in support of the Royal Caribbean website using application and user performance data to guide informed decision making. The SRE will use site performance metrics collected by various sources and tools to support the following tasks: the initial triage of critical production incidents, analysis of bugs, implementing best practices in site reliability engineering, optimizing infrastructure, ensuring seamless collaboration between internal teams and external service providers, among other operational initiatives.

Critical Incident Support

Responsible for the initial response, triage, and communication of key production incidents
(customer impacting) that occur on the site.
Performs analysis of incident impact on site to determine the root cause by reviewing
performance data, including end user experience, application metrics, and infrastructure
metrics
Support product team initiatives and releases
Synthesizes and communicates incident details to the production team, stakeholders, including executive level stakeholders.
Document incident, perform postmortem and create next steps (as needed)

Monitor and Optimize Systems

Provides insight into application performance metrics (errors, exceptions, baseline violations,
etc.) to identify technical impacts of bugs and enhancements.
Understands key performance metrics (traffic volumes, booking volumes, response times, etc.)
to identify business value of bug fixes and enhancements.

Ensure System Reliability and Performance

Understands high level view website operations to identify performance trends between
business processes
Performs daily governance of application monitoring software.

Collaboration with Cross-Functional Teams

Establish and maintain clear communication channels (e.g., Slack, Teams) with the scrum and marketing teams.

Experience

Minimum Years of Experience: 3-6 years in Site Reliability Engineering (SRE), DevOps, QA, or a related IT operations role.

Skills and Abilities

Technical Expertise:

Proficiency in cloud platforms such as AWS, AWS Elastic Beanstalk.
Understanding of API design principles: REST, SOAP, Graph
Advanced knowledge of monitoring and logging tools (AppDynamics, Datadog, Splunk, New Relic, etc.).
Familiarity with Adobe AEM Cloud is preferred to enhance system performance and reliability

Problem-Solving Skills:

Strong analytical and troubleshooting skills to diagnose and resolve complex production issues swiftly.
Ability to develop and implement effective incident response plans.

Communication and Collaboration:
Excellent written and verbal communication skills for effective interaction with cross-functional teams and documentation.
Ability to collaborate with Development, QA, IT, and external managed service providers to ensure seamless operations.

Education

Bachelor’s Degree: In Computer Science, Information Technology, Engineering, or a related field.

Certifications

Preferred Certifications:
Any certification in a public cloud platform(Google Cloud, AWS, Azure) is required with a focus on foundational or intermediate level credentials being acceptable.
Any monitoring and alerting tools equivalent certification
Any certification or equivalent knowledge of IT service management.

Internal Relationships

SCRUM Teams

Purpose of Interaction: Collaboration on technical requirements, code reviews, and sprint planning.
Frequency: Daily stand-ups, bi-weekly sprint planning and reviews, and as needed for specific technical discussions.

IT Teams, Platform Team, and eCom Operations Team

Purpose of Interaction: Ensuring successful deployment, monitoring, and maintenance of applications.
Frequency: Weekly coordination meetings, daily operational check-ins, and as needed for deployment schedules and incident management.

External Relationships

Managed Support Providers

Purpose of Interaction: Perform regular daily, weekly, and monthly meetings with the Managed Service provider to ensure service quality, address issues, and align strategic goals.
Frequency: Daily operational check-ins, weekly status meetings, and monthly reviews.

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Site Reliability Engineer

New SRE Jobs

For SRE Professionals

For Employers

Company