We’re hiring an HPC Platform Engineer to own the on-prem GPU and HPC platform as an integrated whole. Where our Network, Storage, and Systems engineers go deep in their domains, this role owns the layers that hold the platform together—provisioning, GPU orchestration, and the cross-domain integration work that turns a stack of components into a dependable service.
You’ll work hands-on across bare-metal lifecycle (PXE/iPXE, MAAS, BMC/Redfish), Kubernetes with the NVIDIA GPU Operator (MIG/MPS, device plugins, topology-aware scheduling), the GPU stack itself (drivers, CUDA, NCCL, container runtimes), and the integration glue that makes scheduling, networking, storage, and compute behave predictably under real workloads.
You’ll partner closely with the Network, Storage, and Systems engineers—aligning on architecture, escalating cross-domain issues, and making sure each layer’s behavior contributes to a platform users can trust. Success looks like fast, boring bring-ups, calm upgrades, and a platform that holds together under real GPU workloads—training, fine-tuning, and inference.
Success in this role means the on-prem HPC and GPU platform is delivered as a coherent, dependable service—across provisioning, scheduling, networking, storage, and GPU operations—rather than a stack of disconnected components.
Get hands-on with the existing environment: review provisioning, scheduler configuration, GPU stack versions, networking, and storage. Validate observability and incident history, then prioritize the highest-impact reliability and automation work.

%20(2).avif)









.avif)



.avif)
Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.
If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.
Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.