Search through thousands of job postings to find your next opportunity
No technologies match your search.
SAP
Bengaluru, Karnataka, India
Posted 1mo
Indegene
Poland
Posted 1mo
Readiness IT LATAM - una empresa CONKORD
Providencia, Santiago Metropolitan Region, Chile
Posted 1mo
Xero
Seattle, WA
Posted 1mo
Xero
Denver, CO
Posted 1mo
Akamai Technologies
United States
Posted 1mo
Hirenza
United States
Posted 1mo
Sanderson Government & Defence
England, United Kingdom (Remote)
$65,000.00 - $75,000.00
Posted 1mo
Cloudbeds
Romania
Posted 1mo
Informatech Pty Ltd
Canberra, Australian Capital Territory, Australia
$160,000.00 - $200,000.00
Posted 1mo
At the intersection of machine learning and large-scale infrastructure, the SRE team for our Applied Machine Learning group is redefining how intelligent systems operate at global scale. We blend the principles of software engineering with systems reliability to keep our AI and recommendation systems resilient, high-performing, and ever-evolving.
As a Site Reliability Engineer on this team, you'll be hands-on with some of the most advanced AI technologies, helping architect, maintain, and scale machine learning platforms that serve millions-if not billions-of users. You'll also play a critical role in optimizing system performance, making hardware and capacity recommendations, and automating everything possible.
Ensure our ML systems run smoothly, efficiently, and reliably-no matter how complex or large they get.
Dive deep into the guts of distributed systems to identify and resolve bottlenecks before they become outages.
Contribute to and lead the automation of infrastructure, pipelines, and operational routines.
Collaborate with engineering and hardware teams on capacity planning, architecture choices, and performance tuning.
Deep knowledge of distributed systems and the experience to troubleshoot them with precision.
A Bachelor's or Master's in Computer Science or a closely related field focused on software development or systems engineering.
Solid programming chops in at least one of the following: Python, C/C++, or Go.
Strong foundation in algorithms, data structures, and computer science fundamentals.
Experience designing and operating high-scale, high-availability systems.
Passion for writing clean, optimized code and automating away manual tasks.
Prior SRE experience in large distributed production environments.