JustinBradley’s client, a leading source of mortgage financing, is seeking a highly experienced Site Reliability Engineer (SRE) to design, implement, and maintain secure, scalable, and resilient cloud infrastructure. The ideal candidate brings a deep understanding of cloud platforms, DevOps practices, and software development methodologies to ensure operational excellence, high availability, and system reliability.
Responsibilities:
Design, build, and manage cloud-based infrastructure on AWS, Azure, or GCP.
Automate deployments and configurations using Infrastructure-as-Code tools like Terraform, CloudFormation, and Ansible.
Create automation for anomaly detection, self-healing systems, and recovery workflows to reduce toil and optimize cloud costs.
Develop and manage CI/CD pipelines using tools such as Jenkins, GitLab, SonarQube, Docker, and Nexus/Artifactory.
Implement DevSecOps best practices, including IAM roles, RBAC, SAST/DAST/SCA tooling, and vulnerability remediation.
Build observability solutions with monitoring, logging, and tracing tools like AWS CloudWatch, Splunk, SignalFX, Dynatrace, and OpenTelemetry.
Define and track reliability metrics including SLOs, SLIs, error budgets, MTTR, and MTTD.
Architect and support microservices, serverless applications, and RESTful APIs with resilience patterns like Circuit Breaker, Retry, and Timeout.
Conduct chaos engineering experiments using AWS FIS, Chaos Toolkit, and perform resiliency testing via AWS Resilience Hub.
Manage and optimize various databases including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift.
Support production systems including incident response, problem management, and runbook creation as part of on-call rotations.
Collaborate with cross-functional teams to embed shift-left testing strategies (e.g., BDD, TDD, unit, regression).
Maintain architecture documentation, disaster recovery plans, and internal knowledge articles.
Requirements:
8+ years of experience in site reliability engineering or a related field with demonstrated leadership in complex projects.
Strong expertise in cloud platforms (AWS, Azure, or GCP), container orchestration, and infrastructure automation.
Proficiency in scripting/programming languages such as Python, Java, Bash, Node.js, and PowerShell.
Experience with DevOps and observability tools (e.g., Jenkins, Docker, Splunk, Dynatrace, OpenTelemetry).
Deep knowledge of databases including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift.
Familiarity with event-driven architecture, distributed systems, and AI/ML integrations.
Strong understanding of security best practices, compliance frameworks, and incident management.
Hands-on experience with chaos engineering, resiliency assessments, and performance testing (e.g., JMeter, LoadRunner).
Excellent communication and collaboration skills.
AWS Solutions Architect or related cloud certification; Agile Certified Practitioner (ACP) a plus.
Experience with AI/ML frameworks such as Spacy, Transformers, SciPy, and tools like SageMaker and GenAI.
Familiarity with project management and ITSM tools (e.g., JIRA, Confluence, ServiceNow).
Experience with utilities and developer tools like AWS CLI, Postman, and curl.
JustinBradley is an EO employer – Veterans/Disabled and other protected employees.