The SRE is a key player in maintaining and enhancing software systems’ operational efficiency. This role will focus on deployment automation and system optimization, ensuring consistent performance and reliability.

The ideal candidate will have robust problem-solving skills and a strong desire to implement scalable and sustainable technological solutions. Some projects this role will work on include:

· Infrastructure scalability projects: Designing and implementing scalable, highly available system architectures to handle increasing loads and user demands without compromising performance.

· Continuous integration/continuous deployment (CI/CD) pipelines: Creating and optimizing CI/CD pipelines to automate testing and deployment processes, reducing the time from development to production and ensuring consistent quality control.

· Disaster recovery planning: Developing and testing disaster recovery plans to guarantee data integrity, system resilience, and swift restoration of services in case of critical incidents.

Objectives of this role:

· Run the production environment by monitoring availability and taking a holistic view of system health

· Build software and systems to manage platform infrastructure and applications

· Improve reliability, quality, and time-to-market of our suite of software solutions

· Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement

· Provide primary operational support and engineering for multiple large-scale distributed software applications

Responsibilities:

· Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding

· Partner with development teams to improve services through rigorous testing and release procedures

· Participate in system design consulting, platform management, and capacity planning · Create sustainable systems and services through automation and uplifts

· Balance feature development speed and reliability with well-defined service-level objectives

· Monitoring system performance, identifying bottlenecks, and executing pipeline optimization

· Implementing comprehensive service metrics to track and report on system reliability, performance, and efficiency

· Developing and maintaining CI/CD pipelines, enhancing the consistency and speed of software deployment

· Automating routine tasks and creating tools to improve team efficiency and system robustness

· Collaborating with development teams to integrate operational considerations into the software development life cycle

· Conducting post-incident reviews to prevent recurrence and refine the system reliability framework

· Contributing to disaster recovery plans and ensuring robust backup systems are in place

· Develop and provide operational support for full-stack software applications.

· Collaborate with development operations staff to create, monitor, and troubleshoot the system infrastructure.

· Increase system resilience and serve larger customer volumes with expert-level coding, bulletproof release, and change management skills.

· Improve automation and increase the system’s self-healing capability.

· Collect operating system data and report performance metrics to stakeholders.

· Manage cloud and database system maintenance, debugging production issues as they arise.

· Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application.

· Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.

· Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues.

· Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance.

· Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.

· Create and maintain documentation for system architecture, configuration, and troubleshooting procedures.

· Perform capacity planning and resource allocation to ensure optimal system performance and scalability.

· Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.

· Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.

Required skills and qualifications:

· Bachelor’s degree (or equivalent) in computer science or related discipline

· Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript

· Experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)

· Familiarity with DevOps culture and practices and experience with CI/CD toolchains · Industry certifications in cloud services, networking, or systems administration

· Proactive approach to identifying problems, performance bottlenecks, and areas for improvement Soft Skills

· Communication: Articulate complex technical issues and solutions to technical and non-technical team members

· Problem-solving: Analyze challenges and implement effective, long-term solutions under pressure

· Adaptability: Adjust to evolving technologies and changing organizational needs Hard Skills

· Systems architecture: In-depth knowledge of system design and experience with scalable and reliable infrastructure

· Networking and security: Understanding of network protocols, security best practices, and ability to implement secure and robust solutions

· Cloud platforms: Competence in using cloud services such as AWS, GCP, or Azure for deploying, scaling, and managing applications and infrastructure Technical Skills

· Scripting and coding: Proficiency in scripting languages like Python or Bash and coding with languages like Go or Java

· Containerization and orchestration: Familiarity with Docker and Kubernetes for container management and deployment

· Networking fundamentals: Understanding network protocols, load balancing, and firewall management for secure and efficient network operations

· Strong knowledge of Linux/Unix systems and command line tools.

· Proficiency in scripting languages such as Python, Shell, or Perl.

· Experience with configuration management tools like Ansible, Puppet, or Chef.

· Familiarity with cloud platforms like AWS, Azure, or Google Cloud.

· Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.).

· Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools.

· Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk.

· Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues.

· Excellent communication and collaboration skills to work effectively with cross-functional teams.

· Strong attention to detail and ability to work in a fast-paced, dynamic environment.

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Site Reliability Engineer

New SRE Jobs

For SRE Professionals

For Employers

Company