Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.

HPC Platform Engineer

Remote
Full Time
Own on-prem GPU/HPC platforms end-to-end: bare-metal, Slurm/Kubernetes, networking, storage, and GPU orchestration
Remote
Full Time

Job Overview

We’re hiring an HPC Platform Engineer to own the on-prem GPU and HPC platform as an integrated whole. Where our Network, Storage, and Systems engineers go deep in their domains, this role owns the layers that hold the platform together—provisioning, GPU orchestration, and the cross-domain integration work that turns a stack of components into a dependable service.

You’ll work hands-on across bare-metal lifecycle (PXE/iPXE, MAAS, BMC/Redfish), Kubernetes with the NVIDIA GPU Operator (MIG/MPS, device plugins, topology-aware scheduling), the GPU stack itself (drivers, CUDA, NCCL, container runtimes), and the integration glue that makes scheduling, networking, storage, and compute behave predictably under real workloads.

What you’ll focus on

  • Provisioning & lifecycle: bare-metal automation, OS imaging, firmware/driver management, and predictable bring-up
  • GPU orchestration: Kubernetes with the NVIDIA GPU Operator, MIG/MPS, and workload scheduling
  • Cross-domain integration: stitching together scheduling, networking, storage, and compute into a coherent platform
  • Operational excellence: upgrades, capacity planning, runbooks, and incident response with measurable improvements

You’ll partner closely with the Network, Storage, and Systems engineers—aligning on architecture, escalating cross-domain issues, and making sure each layer’s behavior contributes to a platform users can trust. Success looks like fast, boring bring-ups, calm upgrades, and a platform that holds together under real GPU workloads—training, fine-tuning, and inference.

Responsibilities

Success in this role means the on-prem HPC and GPU platform is delivered as a coherent, dependable service—across provisioning, scheduling, networking, storage, and GPU operations—rather than a stack of disconnected components.

  1. Keep the platform available and predictable for compute-intensive workloads at scale.
  2. Reduce operational toil through automation, repeatable processes, and clear standards.
  3. Shorten time-to-resolution for incidents with strong observability and disciplined root-cause analysis.
  4. Ship changes safely with tested procedures, validation, and clean rollback plans.

From day one

Get hands-on with the existing environment: review provisioning, scheduler configuration, GPU stack versions, networking, and storage. Validate observability and incident history, then prioritize the highest-impact reliability and automation work.

What you’ll own

  • Bare-metal provisioning and lifecycle: PXE/iPXE, MAAS or equivalent, golden images, and BMC/Redfish-based automation across heterogeneous hardware.
  • Linux cluster operations: OS, kernel, drivers, systemd, security hardening, and configuration management at scale.
  • Scheduling and orchestration: Slurm (partitions/QoS, fairshare, accounting) and/or Kubernetes (GPU Operator, device plugins, MIG/MPS, topology-aware scheduling).
  • GPU stack health: firmware, CUDA, NVIDIA Container Toolkit, DCGM telemetry, NCCL validation, and known-good state across the fleet.
  • High-performance networking awareness: InfiniBand/RoCE behavior, MTU/PFC/ECN/QoS impact on workloads, and partnering with network engineers on fabric design.
  • Storage integration: parallel filesystems (Lustre/GPFS/BeeGFS) and shared filesystems tuned for HPC I/O patterns.
  • Observability: metrics, logs, alerts, and SLOs tied to availability, utilization, job throughput, and time-to-provision.
  • Upgrades and migrations (Kubernetes, Slurm, OS, drivers, firmware) with tested rollbacks and minimal user impact.
  • Documentation: architecture decisions, runbooks, change records, and post-incident reviews with concrete follow-ups.

Requirements and Skills

Requirements

  • 5+ years operating production Linux infrastructure at scale, including hands-on HPC, GPU, or performance-sensitive environments.
  • Demonstrated breadth: comfortable working across provisioning, scheduling, networking, storage, and GPU operations rather than a single silo.
  • Strong Linux fundamentals: systemd, networking, storage, kernel/driver troubleshooting, and performance debugging.
  • Comfortable with on-call rotations, change windows, and disciplined incident response.

Technical

  • Bare-metal provisioning and lifecycle automation: PXE/iPXE, MAAS or similar, image build pipelines, BMC/IPMI/Redfish, firmware/driver management.
  • Working experience with at least one HPC scheduler (e.g., Slurm) and/or Kubernetes with GPU workloads (NVIDIA GPU Operator, device plugins, MIG/MPS).
  • GPU operations: drivers, CUDA, NVIDIA Container Toolkit, DCGM-based observability, and NCCL validation/troubleshooting.
  • High-performance networking awareness: InfiniBand and/or RoCE fundamentals and how fabric behavior affects real workloads.
  • Parallel storage exposure (e.g., Lustre, GPFS, BeeGFS) and how I/O patterns interact with compute performance.
  • Infrastructure-as-code and configuration management (Ansible, Terraform) plus scripting/automation in Python and/or Go.

Experience

  • Operating GPU or HPC clusters supporting real workloads (training, fine-tuning, inference, MPI/OpenMP, scientific computing).
  • Driving upgrades and migrations safely, with measurable outcomes and clear stakeholder communication.
  • Building automation that makes the platform easier to operate over time—not just one-off scripts.
  • Fluent English for documentation, change planning, and cross-team coordination.

Bonus Points

  • Experience designing or operating multi-tenant GPU platforms or GPU-as-a-Service environments.
  • Familiarity with hybrid Slurm + Kubernetes patterns and converged HPC/AI workflows.
  • Low-level diagnostics across NUMA, IRQ affinity, PCIe topology, and NVIDIA tools (nvidia-smi, nvbandwidth, DCGM, NCCL tests).
  • Contributions to open-source HPC, Kubernetes, or GPU tooling.
Other open positions:
Senior DevOps Engineer
Freelance
Remote
Apply Now
DevOps Lead
Freelance
Remote
Apply Now
Business Development & Account Executive
Freelance
Remote
Apply Now
end to end

Application Process

1

Apply

Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.

2

Screening

If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.

3

Get Matched

Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.

Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.