Principal Engineer - High-Performance AI Infrastructure
Company: Diversity Talent Scouts
Location: San Jose
Posted on: February 15, 2026
|
|
|
Job Description:
Job Description Job Description As a Principal Engineer for HPC
and AI Infrastructure , you’ll take a lead role in designing the
low-level systems that maximize GPU utilization across large,
mission-critical workloads. Working within our GPU Runtime &
Systems team, you’ll focus on device drivers, kernel-level
optimizations, and runtime performance to ensure GPU clusters
deliver the highest throughput, lowest latency, and greatest
reliability possible. Your work will directly accelerate workloads
across deep learning, high-performance computing, and real-time
simulation. This position sits at the intersection of systems
programming, GPU architecture, and HPC-scale computing —a unique
opportunity to shape infrastructure used by developers and
enterprises worldwide. Key Responsibilities Build and optimize
device drivers and runtime components for GPUs and high-speed
interconnects. Collaborate with kernel and platform teams to design
efficient memory pathways (pinned memory, peer-to-peer, unified
memory). Improve data transfers across NVLink, InfiniBand, PCIe,
and RDMA to reduce latency and boost throughput. Enhance GPU memory
operations with NUMA-aware strategies and hardware-coherent
optimizations. Implement telemetry and observability tools to
monitor GPU performance with minimal runtime overhead. Contribute
to internal debugging/profiling tools for GPU workloads. Mentor
engineers on best practices for GPU systems development and
participate in peer design/code reviews. Stay ahead of evolving GPU
and interconnect architectures to influence future infrastructure
design. Minimum Qualifications Bachelor’s degree in a technical
field (STEM), with 10 years in systems programming, including 5
years in GPU runtime or driver development. Experience developing
kernel-space modules or runtime libraries (CUDA, ROCm, OpenCL).
Deep familiarity with NVIDIA GPUs, CUDA toolchains, and profiling
tools (Nsight, CUPTI, etc.). Proven ability to optimize workloads
across NVLink, PCIe, Unified Memory, and NUMA systems. Hands-on
background in RDMA, InfiniBand, GPUDirect, and related
communication frameworks (UCX). Strong C/C++ programming skills
with systems-level expertise (memory management, synchronization,
cache coherency). Preferred Qualifications Expertise in HPC
workload optimization and GPU compute/memory tradeoffs. Knowledge
of pinned memory, peer-to-peer transfers, zero-copy, and GPU memory
lifetimes. Strong grasp of multithreaded and asynchronous
programming patterns. Familiarity with AI frameworks (PyTorch,
TensorFlow) and Python scripting. Understanding of low-level
CUDA/PTX assembly for debugging or performance tuning. Experience
with storage offloads (NVMe, IOAT, DPDK) or DMA-based acceleration.
Proficiency with system profiling/debugging tools (Valgrind,
cuda-memcheck, gdb, Nsight Compute/Systems, perf, eBPF). An
advanced degree (PhD) with research in GPU systems, compilers, or
HPC is a plus.
Keywords: Diversity Talent Scouts, Fremont , Principal Engineer - High-Performance AI Infrastructure, IT / Software / Systems , San Jose, California