Find Your Dream Job

Search through thousands of job postings to find your next opportunity

Date Posted

Job Type

Technology

Work Setting

Salary Range

$0k $100k $200k+

Experience Level

Senior Site Reliability Engineer

Unisys

Reston, VA

Overall years of experience:

  • 8+ years of related experience in their specific area with experience leading teams on projects with similar scope and complexity.
  • Bachelor’s or Master’s degree in computer science or equivalent.
  • Certifications: AWS Solutions Architect, Agile Certified Practitioner (ACP), or relevant cloud certifications.


Job Description:

  • We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our team.
  • The ideal candidate will have a strong background in AWS cloud platforms, DevOps practices, and modern software development frameworks.
  • The Site Reliability Engineer (SRE) will play a critical role in designing, building, and maintaining highly scalable, fault-tolerant, and secure cloud infrastructure while ensuring operational excellence, high availability, and reliability.
  • Some kind of programming/ development knowledge either in Java or in Python, so a good understanding of either Python or Java or both is excellent.


Key Responsibilities:

1. Cloud Infrastructure & Automation:

  • Design, implement, and manage cloud-based infrastructure using platforms in AWS
  • Utilize Infrastructure-as-Code (IaC) tools such as Terraform, CloudFormation, and Ansible to automate deployments and configurations.
  • Create robust automation targeted at anomaly detection, toil reduction, recovery processes, and self-healing mechanisms, and optimize cloud costs.


2. DevSecOps & CI/CD:

  • Deep understanding of DevSecOps principles and CI/CD pipelines using tools like GitLab, Jenkins, SonarQube, Nexus/Artifactory, and Docker.
  • Implement security best practices, including IAM roles, RBAC, vulnerability remediation, and SAST/DAST/SCA tools.


3. Observability & Incident Management:

  • Design and implement monitoring, logging, and distributed tracing solutions using tools like AWS CloudWatch, Splunk/SignalFX, Dynatrace, and OpenTelemetry.
  • Lead root cause analysis, blameless postmortems, and proactive incident management to minimize MTTR and MTTD.
  • Define and monitor SLOs, SLIs, and error budgets to ensure system reliability.


4. Microservices & API Management:

  • Architect and manage microservices, serverless computing, and RESTful APIs.
  • Ensure fault tolerance and resilience using design patterns like Circuit Breaker, Retry, Timeout, and Bulkhead.


5. Chaos Engineering & Resiliency:

  • Conduct chaos engineering experiments using tools like AWS FIS and Chaos Toolkit.
  • Perform resiliency assessments using Resilience Hub and implement self-healing solutions.


6. Database & Application Support:

  • Manage and optimize database technologies such as PostgreSQL, MongoDB, DynamoDB, Oracle, and Redshift.
  • Provide production support, including incident response, problem management, and runbook creation. Participate in on-call rotations.


7. Collaboration & Communication:

  • Collaborate with cross-functional teams to implement shift-left testing practices (BDD, TDD, Unit, Regression).
  • Create and maintain architecture diagrams, knowledge articles, and disaster recovery plans.
  • Communicate effectively with stakeholders and demonstrate strong relationship management skills.


Required Skills & Qualifications:

  • Expertise in cloud platforms (AWS) and container orchestration.
  • Proficiency in programming/scripting languages such as Python, Java, Node.js, Bash, and PowerShell.
  • Strong knowledge of database technologies (e.g., PostgreSQL, MongoDB, DynamoDB, Oracle, Redshift).
  • Experience with DevOps tools (Jenkins, Docker, Nexus/Artifactory) and build tools (Maven, Gradle).
  • Familiarity with AI/ML integrations, event-driven architectures, and distributed systems.
  • Expertise in observability, logging, and monitoring tools (AWS CloudWatch, Splunk, Dynatrace, OpenTelemetry).
  • Strong understanding of security practices, including IAM, RBAC, and vulnerability management.
  • Experience with chaos engineering, resiliency assessments, and disaster recovery planning.
  • Proficiency in performance testing tools (JMeter, LoadRunner) and capacity planning.
  • Excellent verbal and written communication skills, with the ability to collaborate across teams.


Preferred Qualifications:

  • Experience with AI/ML libraries (e.g., NLTK, Transformers, Spacy, SciPy), Amazon SageMaker, and GenAI tools.
  • Familiarity with project management tools like JIRA, Confluence, and ServiceNow.
  • Knowledge of utilities like AWS CLI, POSTMAN, and curl.


#LI-CGTS

#TS-3142

New SRE Jobs

Connecting top SRE talent with leading companies.

For SRE Professionals

For Employers

Company