Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.

HPC Systems Engineer

Remote
Full Time
Own Slurm-based Linux HPC clusters by operating, tuning performance, automating ops, and ensuring reliable compute access
Remote
Full Time

Job Overview

We’re hiring an HPC Systems Engineer to own the reliability, performance, and day-to-day operability of high-performance computing environments. This role is for someone who’s comfortable deep in Linux, understands how researchers and engineers actually use clusters, and can translate that into stable scheduling, predictable throughput, and clean automation.

You’ll focus on building and operating Slurm-based clusters end-to-end: provisioning and configuration, user onboarding, queue and partition design, fair-share policies, and troubleshooting jobs from “why is my node down?” to “why is this MPI run stalling?”. You’ll also help standardize how environments are delivered—modules, containers, images, and repeatable configuration—so the cluster stays maintainable as demand grows.

What success looks like

Clusters are available, observable, and predictable: users can submit jobs with confidence, scheduling behavior matches policy, and incidents are resolved quickly with clear root cause and follow-up improvements. You’ll reduce toil through automation, keep upgrades and changes low-risk, and ensure capacity is used efficiently without sacrificing fairness.

How you’ll work

You’ll collaborate closely with infrastructure and platform stakeholders, partnering with researchers, data/ML teams, and application owners to understand workload patterns and remove bottlenecks. Expect hands-on work, pragmatic trade-offs, and a strong bias toward documentation and operational clarity.

  • Slurm operations: partitions, QoS, accounting, fair-share, and troubleshooting
  • Linux cluster engineering: provisioning, configuration management, patching, and hardening
  • Performance & reliability: monitoring, capacity planning, and incident response

Responsibilities

Success in this role means researchers and engineers get reliable, fast access to compute—without having to think about the cluster.

  1. Keep Slurm-backed compute capacity available, predictable, and right-sized for demand.
  2. Reduce time-to-resolution for incidents through clear runbooks, metrics, and automation.
  3. Improve job throughput and user experience by tuning scheduling, storage, and network performance.
  4. Ship secure, repeatable cluster changes with minimal downtime.

From day one

Get hands-on with existing Linux clusters, Slurm configuration, and operational tooling. Triage current pain points, validate monitoring and alerting, and establish a clear change process for maintenance windows, upgrades, and user-impacting work.

What you’ll own

  • Operate and evolve Slurm (partitions/QoS, fairshare, job accounting, reservations, preemption) to balance throughput, priority, and cost.
  • Administer Linux cluster nodes: provisioning, patching, kernel/driver management, and lifecycle maintenance across heterogeneous hardware.
  • Automate infrastructure and configuration management (e.g., Ansible) and standardize golden images and node bring-up/replace workflows.
  • Troubleshoot performance and reliability issues across compute, storage, and networking; drive root-cause analysis and corrective actions.
  • Own monitoring, logging, and alerting (Prometheus/Grafana, syslog, Slurm metrics); define SLOs and on-call playbooks.
  • Manage shared filesystems and data paths (e.g., Lustre/GPFS/NFS) and tune for parallel I/O workloads.
  • Implement security hardening: least privilege, SSH/key management, audit trails, vulnerability remediation, and segmentation.
  • Collaborate with research teams to translate workload needs into capacity plans, queue policies, and documentation.
  • Plan and execute upgrades (Slurm, OS, drivers, firmware) with tested rollbacks and clear stakeholder communication.

Requirements and Skills

Requirements

  • 3–7 years experience administering Linux systems in production, including multi-node environments.
  • Hands-on experience operating and supporting Slurm in an HPC or research/engineering compute setting.
  • Comfort working in a ticketed/on-call environment and owning incidents through root-cause analysis and follow-up actions.

Technical

  • Strong Linux fundamentals: systemd, networking, storage, kernel/user limits, performance troubleshooting, and security hardening.
  • Cluster operations: provisioning, configuration management, patching, lifecycle management, and documentation for repeatable builds.
  • Slurm administration: partitions/QoS, accounting, fairshare, job priorities, reservations, node states, and troubleshooting scheduling issues.
  • Scripting and automation using Bash and/or Python; ability to build reliable tooling for day-to-day operations.
  • Monitoring and observability: deploying/maintaining metrics, logging, and alerting to support capacity planning and uptime.

Experience

  • Supporting heterogeneous workloads (MPI, GPU, and high-throughput batch) and working with users to optimize job submissions.
  • Collaborating with researchers/engineers to translate workload needs into scheduler and cluster configuration changes.
  • Clear written documentation for runbooks, change records, and post-incident reviews.

Bonus Points

  • Experience with InfiniBand/RDMA, parallel filesystems (e.g., Lustre/BeeGFS), and performance tuning for low-latency interconnects.
  • Experience with containers in HPC (e.g., Apptainer/Singularity) and integrating container workflows with Slurm.
  • Infrastructure-as-code and CI practices (e.g., Ansible, Terraform, Git-based change management).
Other open positions:
Human Resources Specialist
Freelance
Remote
Apply Now
GPU Performance Engineer
Freelance
Remote
Apply Now
Senior Kubernetes Storage Engineer (Rook/Ceph)
Freelance
Remote
Apply Now
end to end

Application Process

1

Apply

Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.

2

Screening

If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.

3

Get Matched

Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.

Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.