Find Your Dream Job

Search through thousands of job postings to find your next opportunity

Date Posted

Job Type

Technology

Work Setting

Salary Range

$0k $100k $200k+

Experience Level

Site Reliability Engineer (SRE)

JustinBradley

Reston, VA

JustinBradley’s client, a leading source of mortgage financing, is seeking a highly experienced Site Reliability Engineer (SRE) to design, implement, and maintain secure, scalable, and resilient cloud infrastructure. The ideal candidate brings a deep understanding of cloud platforms, DevOps practices, and software development methodologies to ensure operational excellence, high availability, and system reliability.


Responsibilities:

  • Design, build, and manage cloud-based infrastructure on AWS, Azure, or GCP.
  • Automate deployments and configurations using Infrastructure-as-Code tools like Terraform, CloudFormation, and Ansible.
  • Create automation for anomaly detection, self-healing systems, and recovery workflows to reduce toil and optimize cloud costs.
  • Develop and manage CI/CD pipelines using tools such as Jenkins, GitLab, SonarQube, Docker, and Nexus/Artifactory.
  • Implement DevSecOps best practices, including IAM roles, RBAC, SAST/DAST/SCA tooling, and vulnerability remediation.
  • Build observability solutions with monitoring, logging, and tracing tools like AWS CloudWatch, Splunk, SignalFX, Dynatrace, and OpenTelemetry.
  • Define and track reliability metrics including SLOs, SLIs, error budgets, MTTR, and MTTD.
  • Architect and support microservices, serverless applications, and RESTful APIs with resilience patterns like Circuit Breaker, Retry, and Timeout.
  • Conduct chaos engineering experiments using AWS FIS, Chaos Toolkit, and perform resiliency testing via AWS Resilience Hub.
  • Manage and optimize various databases including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift.
  • Support production systems including incident response, problem management, and runbook creation as part of on-call rotations.
  • Collaborate with cross-functional teams to embed shift-left testing strategies (e.g., BDD, TDD, unit, regression).
  • Maintain architecture documentation, disaster recovery plans, and internal knowledge articles.


Requirements:

  • 8+ years of experience in site reliability engineering or a related field with demonstrated leadership in complex projects.
  • Strong expertise in cloud platforms (AWS, Azure, or GCP), container orchestration, and infrastructure automation.
  • Proficiency in scripting/programming languages such as Python, Java, Bash, Node.js, and PowerShell.
  • Experience with DevOps and observability tools (e.g., Jenkins, Docker, Splunk, Dynatrace, OpenTelemetry).
  • Deep knowledge of databases including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift.
  • Familiarity with event-driven architecture, distributed systems, and AI/ML integrations.
  • Strong understanding of security best practices, compliance frameworks, and incident management.
  • Hands-on experience with chaos engineering, resiliency assessments, and performance testing (e.g., JMeter, LoadRunner).
  • Excellent communication and collaboration skills.
  • AWS Solutions Architect or related cloud certification; Agile Certified Practitioner (ACP) a plus.
  • Experience with AI/ML frameworks such as Spacy, Transformers, SciPy, and tools like SageMaker and GenAI.
  • Familiarity with project management and ITSM tools (e.g., JIRA, Confluence, ServiceNow).
  • Experience with utilities and developer tools like AWS CLI, Postman, and curl.


JustinBradley is an EO employer – Veterans/Disabled and other protected employees.

New SRE Jobs

Connecting top SRE talent with leading companies.

For SRE Professionals

For Employers

Company