As an HPC Network Engineer, you’ll own the design, tuning, and operational reliability of high-performance network fabrics where microseconds matter. This role is for someone who can translate application and cluster demands into stable, measurable network performance—then keep it that way under real workloads.
You’ll work hands-on with InfiniBand and Ethernet-based RDMA (RoCE) environments, focusing on lossless behavior, congestion control, and end-to-end latency. The work spans architecture decisions, build and validation, and pragmatic troubleshooting across switches, HCAs/NICs, cabling/optics, and host configuration. You’ll be expected to use data—telemetry, counters, packet captures, and benchmarks—to pinpoint bottlenecks and drive improvements.
You’ll collaborate closely with HPC/cluster engineers and application teams to align fabric behavior with real job profiles (MPI collectives, storage traffic, east-west patterns). Success looks like predictable latency, high utilization without instability, and faster time-to-diagnosis when issues occur—documented in runbooks and reflected in measurable improvements over time.
Success in this role means delivering a low-latency, high-throughput fabric that stays stable under real HPC and AI workloads—and making it easier to operate over time through automation, observability, and clear runbooks.
Start by learning the current topology, failure modes, and operational cadence. Validate baseline performance, review recent incidents, and identify the highest-impact improvements across configuration, monitoring, and change management.




%20(2).avif)
.avif)









.avif)
Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.
If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.
Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.