JustinBradley’s client, a leading source of mortgage financing, is seeking a highly experienced Site Reliability Engineer (SRE) to design, implement, and maintain secure, scalable, and resilient cloud infrastructure. The ideal candidate brings a deep understanding of cloud platforms, DevOps practices, and software development methodologies to ensure operational excellence, high availability, and system reliability.

Responsibilities:

Cloud Infrastructure Management: Design, build, and maintain secure, scalable, and resilient cloud-based infrastructure on AWS, Azure, or GCP, with a strong focus on AWS (ECS, Lambda, RDS) for production environments.
Automation and Optimization: Automate deployments and configurations using Infrastructure-as-Code tools like Terraform, CloudFormation, and Ansible, while driving automation for anomaly detection, self-healing systems, and recovery workflows to reduce operational toil and optimize cloud costs.
CI/CD Pipeline Development: Develop and manage CI/CD pipelines using tools such as Jenkins, GitLab, SonarQube, Docker, and Nexus/Artifactory to streamline deployment processes.
DevSecOps Implementation: Implement DevSecOps best practices, including IAM roles, RBAC, SAST/DAST/SCA tooling, and vulnerability remediation, ensuring secure cloud operations.
Observability & Monitoring: Build observability solutions with monitoring, logging, and tracing tools like AWS CloudWatch, Splunk, Dynatrace, and OpenTelemetry to ensure the reliability and performance of cloud applications.
Reliability Metrics: Define and track reliability metrics such as SLOs, SLIs, error budgets, MTTR, and MTTD to ensure operational excellence and system uptime.
Microservices & Serverless Architecture: Architect and support microservices, serverless applications, and RESTful APIs with resilience patterns like Circuit Breaker, Retry, and Timeout.
Chaos Engineering: Conduct chaos engineering experiments using AWS FIS, Chaos Toolkit, and AWS Resilience Hub to improve system resilience and reliability under failure conditions.
Database Management: Manage and optimize various databases including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift, ensuring high availability and performance.
Incident Management & Support: Lead production support efforts, including incident response, problem management, and runbook creation as part of on-call rotations, ensuring swift resolution of critical production issues.
Cross-Functional Collaboration: Collaborate closely with cross-functional teams to embed shift-left testing strategies (e.g., BDD, TDD, unit testing, regression testing) into the development pipeline.
Documentation & Knowledge Sharing: Maintain architecture documentation, disaster recovery plans, and internal knowledge articles, fostering a culture of knowledge sharing and continuous improvement.

Requirements:

8+ years of experience in Site Reliability Engineering or a related field, with a proven track record of leading complex projects and ensuring system reliability at scale.
Expertise in Cloud Platforms (AWS, Azure, or GCP), with hands-on experience in container orchestration, microservices, and serverless architectures.
Proficiency in scripting/programming languages such as Python, Java, Bash, Node.js, and PowerShell for automation and system integration.
Extensive experience with DevOps and observability tools (e.g., Jenkins, Docker, Splunk, Dynatrace, OpenTelemetry), ensuring a seamless development and monitoring workflow.
Deep knowledge of database management, including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift, with the ability to optimize for performance and availability.
Familiarity with event-driven architecture, distributed systems, and integrating AI/ML frameworks and services into the infrastructure.
Strong understanding of security best practices, compliance frameworks, and experience in incident management.
Hands-on experience with chaos engineering, resiliency testing, and performance testing (e.g., JMeter, LoadRunner) to ensure systems are fault-tolerant and performant under load.
Excellent communication and collaboration skills, with experience leading cross-functional teams and influencing stakeholders to implement best practices.
AWS Solutions Architect or related cloud certification; Agile Certified Practitioner (ACP) is a plus.
Experience with developer tools such as AWS CLI, Postman, and curl.
Experience with production support and on-call rotations (flexible hours, typically twice a month).

JustinBradley is an EO employer – Veterans/Disabled and other protected employees.

Find Your Dream Job

Date Posted

Job Type

Technology

Work Setting

Salary Range

Experience Level

4330 matching jobs

Associate DevOps Engineer(Kubernetes, CI/CD, container orchestration)

DevOps Engineer - AWS

Devops Junior

Site Reliability Engineer

Site Reliability Engineer

Director Site Reliability Engineering

DevOps Engineer

DevSecOps Engineer

Site Reliability Engineer

Azure DevOps Engineer

Site Reliability Engineer (SRE)

New SRE Jobs

For SRE Professionals

For Employers

Company