JustinBradley’s client, a leading source of mortgage financing, is seeking a highly experienced Site Reliability Engineer (SRE) to design, implement, and maintain secure, scalable, and resilient cloud infrastructure. The ideal candidate brings a deep understanding of cloud platforms, DevOps practices, and software development methodologies to ensure operational excellence, high availability, and system reliability.
Responsibilities:
- Cloud Infrastructure Management: Design, build, and maintain secure, scalable, and resilient cloud-based infrastructure on AWS, Azure, or GCP, with a strong focus on AWS (ECS, Lambda, RDS) for production environments.
- Automation and Optimization: Automate deployments and configurations using Infrastructure-as-Code tools like Terraform, CloudFormation, and Ansible, while driving automation for anomaly detection, self-healing systems, and recovery workflows to reduce operational toil and optimize cloud costs.
- CI/CD Pipeline Development: Develop and manage CI/CD pipelines using tools such as Jenkins, GitLab, SonarQube, Docker, and Nexus/Artifactory to streamline deployment processes.
- DevSecOps Implementation: Implement DevSecOps best practices, including IAM roles, RBAC, SAST/DAST/SCA tooling, and vulnerability remediation, ensuring secure cloud operations.
- Observability & Monitoring: Build observability solutions with monitoring, logging, and tracing tools like AWS CloudWatch, Splunk, Dynatrace, and OpenTelemetry to ensure the reliability and performance of cloud applications.
- Reliability Metrics: Define and track reliability metrics such as SLOs, SLIs, error budgets, MTTR, and MTTD to ensure operational excellence and system uptime.
- Microservices & Serverless Architecture: Architect and support microservices, serverless applications, and RESTful APIs with resilience patterns like Circuit Breaker, Retry, and Timeout.
- Chaos Engineering: Conduct chaos engineering experiments using AWS FIS, Chaos Toolkit, and AWS Resilience Hub to improve system resilience and reliability under failure conditions.
- Database Management: Manage and optimize various databases including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift, ensuring high availability and performance.
- Incident Management & Support: Lead production support efforts, including incident response, problem management, and runbook creation as part of on-call rotations, ensuring swift resolution of critical production issues.
- Cross-Functional Collaboration: Collaborate closely with cross-functional teams to embed shift-left testing strategies (e.g., BDD, TDD, unit testing, regression testing) into the development pipeline.
- Documentation & Knowledge Sharing: Maintain architecture documentation, disaster recovery plans, and internal knowledge articles, fostering a culture of knowledge sharing and continuous improvement.
Requirements:
- 8+ years of experience in Site Reliability Engineering or a related field, with a proven track record of leading complex projects and ensuring system reliability at scale.
- Expertise in Cloud Platforms (AWS, Azure, or GCP), with hands-on experience in container orchestration, microservices, and serverless architectures.
- Proficiency in scripting/programming languages such as Python, Java, Bash, Node.js, and PowerShell for automation and system integration.
- Extensive experience with DevOps and observability tools (e.g., Jenkins, Docker, Splunk, Dynatrace, OpenTelemetry), ensuring a seamless development and monitoring workflow.
- Deep knowledge of database management, including PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift, with the ability to optimize for performance and availability.
- Familiarity with event-driven architecture, distributed systems, and integrating AI/ML frameworks and services into the infrastructure.
- Strong understanding of security best practices, compliance frameworks, and experience in incident management.
- Hands-on experience with chaos engineering, resiliency testing, and performance testing (e.g., JMeter, LoadRunner) to ensure systems are fault-tolerant and performant under load.
- Excellent communication and collaboration skills, with experience leading cross-functional teams and influencing stakeholders to implement best practices.
- AWS Solutions Architect or related cloud certification; Agile Certified Practitioner (ACP) is a plus.
- Experience with developer tools such as AWS CLI, Postman, and curl.
- Experience with production support and on-call rotations (flexible hours, typically twice a month).
JustinBradley is an EO employer – Veterans/Disabled and other protected employees.