Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.

GPU Performance Engineer

Remote
Full Time
Drive CUDA/NCCL performance by profiling GPU and multi-node workloads, optimizing kernels and comms, and stabilizing training/inference ops
Remote
Full Time

Job Overview

We’re looking for a GPU Performance Engineer to help teams turn “it runs” into “it scales.” This role is for someone who can move confidently between GPU kernels, distributed training communication, and system-level bottlenecks—then translate findings into practical fixes that improve throughput, latency, and cost.

You’ll focus on performance work across CUDA code paths, NCCL-backed multi-GPU/multi-node communication, and end-to-end profiling. The goal is not theoretical optimization—it’s measurable wins: faster training steps, better GPU utilization, stable scaling efficiency, and clear evidence for what changed and why.

What you’ll actually do

You’ll profile real workloads, form hypotheses, run controlled experiments, and deliver improvements that hold up in production. You’ll work across the stack—from kernel-level tuning and memory behavior to overlap of compute/communication and network-aware scaling.

  • Profiling & root cause analysis: use tools like Nsight Systems/Compute and CUDA profiling to pinpoint bottlenecks
  • Distributed performance: analyze NCCL collectives, topology effects, and scaling limits across nodes
  • Optimization delivery: implement and validate changes with benchmarks, regression guards, and clear documentation

How success is measured

Success looks like sustained performance gains that are easy to verify: improved step time, higher effective TFLOPs, better scaling efficiency, fewer performance regressions, and a repeatable methodology the team can keep using after your engagement.

How you’ll work with the team

You’ll collaborate closely with ML engineers, systems/platform teams, and anyone touching the training stack. You’ll be expected to communicate clearly—sharing traces, explaining tradeoffs, and recommending the next highest-leverage change instead of chasing micro-optimizations.

Responsibilities

You will make GPU and multi-node training/inference faster, more stable, and easier to operate by turning profiling data into concrete fixes across kernels, communication, and system configuration.

  1. Deliver measurable throughput/latency gains on real workloads (not microbenchmarks alone).
  2. Reduce performance variance and tail latency across runs, nodes, and clusters.
  3. Provide clear, reproducible evidence for bottlenecks and the impact of each change.
  4. Improve developer feedback loops with repeatable benchmarks and automated regression checks.

From day one

Start by reproducing current performance baselines, validating profiling methodology, and mapping the top bottlenecks across compute, memory, and communication. You’ll work closely with ML engineers and systems teams to prioritize fixes and define success metrics.

What you’ll own

  • Profile and optimize CUDA kernels using Nsight Systems/Compute, CUPTI, and targeted microbenchmarks.
  • Tune collective communication with NCCL (topology-aware settings, channel configuration, IB/RoCE parameters) and validate with multi-node runs.
  • Identify and fix memory bottlenecks (HBM utilization, L2 behavior, allocation patterns, fragmentation) and improve overlap of compute/comm.
  • Drive end-to-end performance improvements across frameworks (e.g., PyTorch/XLA) and custom extensions.
  • Build and maintain benchmarking suites: representative workloads, controlled environments, and statistically sound comparisons.
  • Implement performance regression detection in CI and set actionable thresholds tied to business-critical metrics.
  • Partner with infrastructure/ops to tune drivers, CUDA/NCCL versions, container images, and cluster settings for stability and repeatability.
  • Produce concise performance reports and root-cause analyses with recommended changes and verified outcomes.

Requirements and Skills

Requirements

  • 3–7+ years in GPU/HPC performance engineering, CUDA optimization, or systems-level performance work.
  • Strong proficiency in C/C++ and CUDA; comfortable reading and modifying performance-critical code.
  • Hands-on experience profiling, diagnosing, and optimizing GPU kernels and end-to-end workloads.
  • Solid understanding of GPU architecture fundamentals (memory hierarchy, occupancy, warp execution, synchronization, atomics).

Technical

  • Deep experience with CUDA profiling and analysis tools (e.g., Nsight Systems, Nsight Compute, CUPTI) and turning findings into measurable speedups.
  • Working knowledge of multi-GPU communication and collectives using NCCL; ability to troubleshoot scaling and communication bottlenecks.
  • Experience optimizing memory movement (H2D/D2H transfers, pinned memory, streams, overlap, GPUDirect where applicable) and reducing synchronization overhead.
  • Comfortable with Linux performance tooling and debugging (perf, gdb, strace, flame graphs) and CI-friendly benchmarking practices.

Experience

  • Track record of delivering performance improvements with clear baselines, metrics, and regression prevention.
  • Experience working with distributed training or HPC workloads (single-node and multi-node), including performance at scale.
  • Ability to reproduce and isolate performance issues across hardware/software configurations (driver, CUDA toolkit, kernel, interconnect).

Skills & Qualities

  • Strong problem decomposition: can move from symptom to root cause using experiments, instrumentation, and data.
  • Clear written and verbal communication in English, including documenting findings and proposing actionable fixes.
  • Collaborates effectively with ML engineers, systems engineers, and researchers; can review code and provide pragmatic optimization guidance.

Bonus Points

  • Experience with frameworks and runtimes such as PyTorch, TensorFlow, JAX, Triton, or custom CUDA extensions.
  • Familiarity with RDMA/Infiniband, NCCL tuning, topology-aware communication, and cluster-level performance diagnostics.
  • Experience with performance portability and mixed precision (Tensor Cores, FP16/BF16/FP8) and numerical stability trade-offs.
Other open positions:
Sales Manager
Freelance
Remote
Apply Now
Kafka Expert
Freelance
Remote
Apply Now
HPC Systems Engineer
Freelance
Remote
Apply Now
end to end

Application Process

1

Apply

Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.

2

Screening

If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.

3

Get Matched

Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.

Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.