Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.

HPC Network Engineer

Remote
Full Time
Own design, tuning, and reliability of InfiniBand/RDMA and RoCE fabrics, driving automation, observability, and incident response
Remote
Full Time

Job Overview

As an HPC Network Engineer, you’ll own the design, tuning, and operational reliability of high-performance network fabrics where microseconds matter. This role is for someone who can translate application and cluster demands into stable, measurable network performance—then keep it that way under real workloads.

You’ll work hands-on with InfiniBand and Ethernet-based RDMA (RoCE) environments, focusing on lossless behavior, congestion control, and end-to-end latency. The work spans architecture decisions, build and validation, and pragmatic troubleshooting across switches, HCAs/NICs, cabling/optics, and host configuration. You’ll be expected to use data—telemetry, counters, packet captures, and benchmarks—to pinpoint bottlenecks and drive improvements.

What you’ll focus on

  • Low-latency fabric design: topology selection, oversubscription tradeoffs, and scalable growth planning
  • RDMA performance engineering: MTU, PFC/ECN, QoS, congestion control, and kernel/driver tuning
  • Operational excellence: repeatable build standards, change control, upgrades, and incident response with clear root-cause analysis
  • Validation and observability: performance baselines, regression testing, and actionable monitoring for fabric health

You’ll collaborate closely with HPC/cluster engineers and application teams to align fabric behavior with real job profiles (MPI collectives, storage traffic, east-west patterns). Success looks like predictable latency, high utilization without instability, and faster time-to-diagnosis when issues occur—documented in runbooks and reflected in measurable improvements over time.

Responsibilities

Success in this role means delivering a low-latency, high-throughput fabric that stays stable under real HPC and AI workloads—and making it easier to operate over time through automation, observability, and clear runbooks.

  1. Keep InfiniBand/RDMA and RoCE fabrics meeting agreed latency, loss, and throughput targets.
  2. Reduce incident frequency and mean time to recovery through proactive design and tooling.
  3. Make changes safely with repeatable procedures, testing, and clear rollback plans.

From day one

Start by learning the current topology, failure modes, and operational cadence. Validate baseline performance, review recent incidents, and identify the highest-impact improvements across configuration, monitoring, and change management.

What you’ll own

  • Design and evolve InfiniBand and Ethernet/RoCE architectures (spine-leaf, rail-optimized fabrics, multi-tenant segmentation where applicable).
  • Configure and operate switches, HCAs, and fabric services (e.g., SM, partitioning, ECN/PFC/DCB, congestion control) to meet workload requirements.
  • Lead performance validation: benchmarking, packet/flow analysis, microburst detection, and root-cause analysis for latency, drops, and congestion.
  • Drive reliability practices: redundancy planning, firmware lifecycle management, maintenance windows, and post-incident reviews with actionable follow-ups.
  • Build automation for provisioning, configuration drift detection, and compliance checks (e.g., Ansible, Python, Git-based workflows).
  • Implement observability: telemetry, logs, and alerting tied to SLOs (latency, loss, link health, buffer utilization, error counters).
  • Partner with platform, systems, and workload teams to align fabric behavior with scheduler, storage, and GPU cluster needs.
  • Maintain clear documentation and runbooks for standard operations, troubleshooting, and escalation paths.

Requirements and Skills

Requirements

  • 3–7+ years of hands-on network engineering experience in HPC, low-latency trading, or large-scale compute environments.
  • Strong Linux networking fundamentals (routing, bridging, VLANs, bonding/LACP, MTU/jumbo frames, IRQ/CPU affinity basics).
  • Comfortable working in on-call/incident response rotations and performing maintenance in scheduled change windows.

Low-Latency Fabric (InfiniBand / RDMA / RoCE)

  • Production experience designing, deploying, and operating InfiniBand fabrics (topologies, subnet management, partitioning, link bring-up, health checks).
  • Practical RDMA knowledge: verbs concepts, queue pairs, congestion behavior, and how application requirements translate to fabric configuration.
  • RoCE (v1/v2) implementation experience, including DCB/PFC/ECN configuration and validation to prevent loss and manage congestion.
  • Ability to troubleshoot latency, packet loss, and microbursts using counter-based analysis and targeted tests (e.g., per-port counters, congestion telemetry, synthetic traffic).

Operations, Automation & Tooling

  • Proficiency with common fabric and NIC tools (e.g., Mellanox/NVIDIA utilities, firmware management, link diagnostics, cable/optics validation).
  • Experience building repeatable operations via automation (Python and/or Bash; Ansible preferred) for config rollout, validation, and drift detection.
  • Strong debugging approach: clear incident timelines, root-cause analysis, and actionable postmortems with preventative follow-ups.

Collaboration & Communication

  • Ability to partner closely with compute/storage teams to align network design with workload needs (MPI, distributed training, parallel file systems).
  • Fluent English for technical documentation, change plans, and cross-team coordination.

Bonus Points

  • Experience with GPU clusters and performance-sensitive workloads (MPI tuning considerations, NCCL/RDMA path awareness).
  • Familiarity with parallel storage networking patterns (e.g., Lustre/GPFS access networks) and their latency/throughput trade-offs.
  • Exposure to telemetry/monitoring stacks (Prometheus/Grafana, ELK/OpenSearch) and building meaningful SLOs for fabric health.
Other open positions:
Kafka Expert
Freelance
Remote
Apply Now
HPC Platform Engineer
Freelance
Remote
Apply Now
GPU Performance Engineer
Freelance
Remote
Apply Now
end to end

Application Process

1

Apply

Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.

2

Screening

If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.

3

Get Matched

Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.

Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.