Staffing Lab is looking for a Senior Site Reliability Engineer for one of it's clients. This position could be contract to hire or Full Time. No C2C, No Recruiters, 3rd Parties, etc...
We are seeking a Senior Site Reliability Engineer (SRE) with deep expertise in AWS networking, infrastructure automation, and production system reliability. This role demands a strong grasp of observability, operational excellence, and the ability to drive the adoption of DevOps/SRE best practices across engineering teams. You will be instrumental in shaping SLIs/SLOs, defining our DevOps maturity roadmap, and building robust, scalable infrastructure using Terraform, Lambda, Step Functions, and more.
You’ll be leading a team of SREs and collaborating closely with DevOps, Security, and Application teams to ensure reliable delivery and availability of services.
Key Responsibilities:
- Lead and mentor a team of SREs in developing scalable infrastructure and operational processes.
- Design and implement SLIs, SLOs, and Error Budgets across critical services and evangelize them across product teams.
- Architect and manage AWS networking environments including VPCs, Transit Gateways, Route 53, VPNs, NACLs, and Security Groups.
- Manage and monitor Palo Alto and Fortigate firewalls, and integrate them with cloud environments for hybrid network visibility.
- Define and evolve DevOps maturity models, guiding teams toward higher automation and reliability.
- Build and manage observability dashboards using Grafana, Cloudwatch and Datadog to track application and infrastructure health.
- Implement and maintain Infrastructure as Code (IaC) using Terraform to automate cloud deployments across environments.
- Develop and maintain serverless applications using AWS Lambda and Step Functions to support platform automation and operations.
- Collaborate with developers to define GitLab CI/CD pipelines and streamline the build, test, and deployment lifecycle.
- Champion incident response, blameless postmortems, and continuous improvement initiatives.
- Write scripts in Python or Bash to automate tasks and integrate systems.
Required Qualifications:
- 7+ years in SRE, DevOps, or Systems Engineering roles with increasing responsibility.
- Proven experience managing AWS production environments with a focus on networking.
- In-depth knowledge of Palo Alto and/or Fortigate firewall management and troubleshooting.
- Expertise in monitoring and observability tools, including Grafana and Datadog.
- Hands-on experience with Terraform in managing cloud infrastructure at scale.
- Experience building and deploying serverless architectures using Lambda and Step Functions.
- Demonstrated understanding of SLI/SLO design, error budgets, and reliability metrics.
- Strong understanding of CI/CD principles and tools like GitLab CI/CD.
- Proficiency in scripting using Python or Bash.
Preferred Qualifications:
- AWS Certifications (e.g., Solutions Architect, Advanced Networking, DevOps Engineer)
- Familiarity with DevOps/SRE maturity models and implementing organizational transformation.
- Experience with compliance frameworks (SOC2, ISO 27001, etc.) as they pertain to infrastructure reliability.
- Familiarity with container orchestration is a plus.
Soft Skills:
- Strong leadership and mentoring capabilities.
- Ability to translate complex technical problems into actionable initiatives.
- Excellent communication and cross-functional collaboration skills.
- Bias for automation and continuous improvement.