Staff Site Reliability Engineer, Storage
Company: Crusoe Energy Systems LLC
Location: San Francisco
Posted on: May 18, 2025
|
|
Job Description:
Crusoe is building the World's Favorite AI-first Cloud
infrastructure company. We're pioneering vertically integrated,
purpose-built AI infrastructure solutions trusted by Fortune 500
companies to power their most advanced AI applications. Crusoe is
redefining AI cloud infrastructure, with a mission to align the
future of computing with the future of the climate. Our AI platform
is recognized as the "gold standard" for reliability and
performance. Our data centers are optimized for AI workloads and
are powered by clean, renewable energy.Be part of the AI revolution
with sustainable technology at Crusoe. Here, you'll drive
meaningful innovation, make a tangible impact, and join a team
that's setting the pace for responsible, transformative cloud
infrastructure.About This Role:At Crusoe Energy Systems, our Site
Reliability Engineering (SRE) team plays a mission-critical role in
maintaining the performance and reliability of our AI-optimized
cloud infrastructure. The Storage-focused SRE role is responsible
for ensuring the availability, performance, and scalability of
Crusoe's cloud storage products and services, which power
compute-intensive, latency-sensitive workloads for AI and HPC use
cases. This role directly supports our vertically integrated,
sustainable cloud platform by building and optimizing distributed,
fault-tolerant storage systems at scale.What You'll Be Working
On:In this role, you will build automation and self-healing tools
to monitor and maintain Crusoe's distributed cloud storage
infrastructure, which includes block, file, and object storage
systems. You will drive reliability initiatives focused on data
replication, encryption, backup and restore strategies, and robust
failover mechanisms. Collaborating closely with storage engineers,
you will help implement and maintain high-performance NVMe- and
SSD-backed volumes that support large-scale AI compute clusters.
Your responsibilities will also include supporting user-facing
storage services with a focus on availability, performance tuning,
and adherence to error budgets. You'll investigate and resolve
storage-related incidents using deep telemetry, logs, and
performance profiling, while also partnering with hardware and
kernel teams to diagnose low-level I/O issues and optimize I/O
paths, cache policies, and file systems. Additionally, you will
contribute to the architecture of fault-tolerant, scalable storage
backends tailored for AI-first cloud environments.
What You'll Bring to the Team:
#J-18808-Ljbffr
Keywords: Crusoe Energy Systems LLC, Fremont , Staff Site Reliability Engineer, Storage, Engineering , San Francisco, California
Click
here to apply!
|