Find Your Dream Job

Search through thousands of job postings to find your next opportunity

Date Posted

Job Type

Technology

Work Setting

Salary Range

$0k $100k $200k+

Experience Level

Site Reliability Engineer (SRE)

JustinBradley

Reston, VA

JustinBradley’s client, a leading source of mortgage financing, is seeking a highly experienced Site Reliability Engineer (SRE) to design, implement, and maintain secure, scalable, and resilient cloud infrastructure. The ideal candidate brings a deep understanding of cloud platforms, DevOps practices, and software development methodologies to ensure operational excellence, high availability, and system reliability.


Responsibilities:

  • Cloud Infrastructure Management: Design, build, and maintain secure, scalable, and resilient cloud-based infrastructure on AWS, Azure, or GCP, with a strong focus on AWS (ECS, Lambda, RDS) for production environments.
  • Automation and Optimization: Automate deployments and configurations using Infrastructure-as-Code tools like Terraform, CloudFormation, and Ansible, while driving automation for anomaly detection, self-healing systems, and recovery workflows to reduce operational toil and optimize cloud costs.
  • CI/CD Pipeline Development: Develop and manage CI/CD pipelines using tools such as Jenkins, GitLab, SonarQube, Docker, and Nexus/Artifactory to streamline deployment processes.
  • DevSecOps Implementation: Implement DevSecOps best practices, including IAM roles, RBAC, SAST/DAST/SCA tooling, and vulnerability remediation, ensuring secure cloud operations.
  • Observability & Monitoring: Build observability solutions with monitoring, logging, and tracing tools like AWS CloudWatch, Splunk, Dynatrace, and OpenTelemetry to ensure the reliability and performance of cloud applications.
  • Reliability Metrics: Define and track reliability metrics such as SLOs, SLIs, error budgets, MTTR, and MTTD to ensure operational excellence and system uptime.
  • Microservices & Serverless Architecture: Architect and support microservices, serverless applications, and RESTful APIs with resilience patterns like Circuit Breaker, Retry, and Timeout.
  • Chaos Engineering: Conduct chaos engineering experiments using AWS FIS, Chaos Toolkit, and AWS Resilience Hub to improve system resilience and reliability under failure conditions.
  • Database Management: Manage and optimize various databases including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift, ensuring high availability and performance.
  • Incident Management & Support: Lead production support efforts, including incident response, problem management, and runbook creation as part of on-call rotations, ensuring swift resolution of critical production issues.
  • Cross-Functional Collaboration: Collaborate closely with cross-functional teams to embed shift-left testing strategies (e.g., BDD, TDD, unit testing, regression testing) into the development pipeline.
  • Documentation & Knowledge Sharing: Maintain architecture documentation, disaster recovery plans, and internal knowledge articles, fostering a culture of knowledge sharing and continuous improvement.


Requirements:

  • 8+ years of experience in Site Reliability Engineering or a related field, with a proven track record of leading complex projects and ensuring system reliability at scale.
  • Expertise in Cloud Platforms (AWS, Azure, or GCP), with hands-on experience in container orchestration, microservices, and serverless architectures.
  • Proficiency in scripting/programming languages such as Python, Java, Bash, Node.js, and PowerShell for automation and system integration.
  • Extensive experience with DevOps and observability tools (e.g., Jenkins, Docker, Splunk, Dynatrace, OpenTelemetry), ensuring a seamless development and monitoring workflow.
  • Deep knowledge of database management, including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift, with the ability to optimize for performance and availability.
  • Familiarity with event-driven architecture, distributed systems, and integrating AI/ML frameworks and services into the infrastructure.
  • Strong understanding of security best practices, compliance frameworks, and experience in incident management.
  • Hands-on experience with chaos engineering, resiliency testing, and performance testing (e.g., JMeter, LoadRunner) to ensure systems are fault-tolerant and performant under load.
  • Excellent communication and collaboration skills, with experience leading cross-functional teams and influencing stakeholders to implement best practices.
  • AWS Solutions Architect or related cloud certification; Agile Certified Practitioner (ACP) is a plus.
  • Experience with developer tools such as AWS CLI, Postman, and curl.
  • Experience with production support and on-call rotations (flexible hours, typically twice a month).


JustinBradley is an EO employer – Veterans/Disabled and other protected employees.

New SRE Jobs

Connecting top SRE talent with leading companies.

For SRE Professionals

For Employers

Company