Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.

HPC Storage Engineer

Remote
Full Time
Design, operate, and optimize Lustre/GPFS/BeeGFS HPC storage for high performance, reliability, scalability, and predictable I/O
Remote
Full Time

Job Overview

We’re looking for an HPC Storage Engineer to design, operate, and improve high-throughput storage platforms that power demanding compute workloads. This role is for someone who can move comfortably between architecture and hands-on operations—building reliable parallel filesystems, tuning performance under real load, and keeping availability high when the cluster is busy.

You’ll work across the full lifecycle: deploying and upgrading Lustre, IBM Spectrum Scale (GPFS), and/or BeeGFS; integrating with compute schedulers and network fabrics; and establishing clean operational practices around monitoring, capacity planning, and incident response. You’ll be expected to diagnose complex I/O and metadata bottlenecks, validate changes with meaningful benchmarks, and translate findings into practical improvements.

What success looks like

You leave storage in a clearly better state than you found it—faster, more stable, and easier to operate. Stakeholders trust the platform because performance is measured, issues are understood, and changes are executed with discipline.

  • Predictable performance for mixed HPC/AI workloads through tuning and evidence-based benchmarking
  • Operational resilience via sensible automation, monitoring, and upgrade/rollback planning
  • Clear ownership of capacity trends, failure modes, and documentation that other engineers can use

This is a collaborative engineering role: you’ll partner closely with HPC, Linux, networking, and platform teams to ensure storage design matches workload reality—and you’ll communicate tradeoffs clearly when reliability, cost, and performance pull in different directions.

Responsibilities

You will keep large-scale HPC storage fast, reliable, and predictable—so compute users can focus on science and engineering, not I/O bottlenecks.

  1. Deliver stable, high-performance parallel filesystem services (Lustre, IBM Spectrum Scale/GPFS, BeeGFS) with clear SLAs/SLOs.
  2. Reduce incident frequency and time-to-recovery through better observability, runbooks, and automation.
  3. Improve throughput, latency, and metadata performance through tuning and data-path optimization.
  4. Enable secure, scalable access for diverse workloads (MPI, AI/ML, analytics) across clusters and networks.

From day one

Get hands-on with the existing storage estate: review architecture, failure history, and performance baselines; validate monitoring and alerting; and take ownership of the highest-impact reliability and performance improvements.

What you’ll own

  • Design and evolve filesystem architectures: MDS/MDT, OSS/OST, NSDs, targets, pools, quotas, snapshots, and tiering.
  • Operate and troubleshoot production incidents across storage, networking (InfiniBand/Ethernet), and Linux kernel/user-space components.
  • Performance engineering: benchmark (IOR, mdtest, fio), analyze bottlenecks, tune clients/servers, and validate changes with measurable results.
  • Capacity planning and lifecycle management: growth forecasting, rebalancing, upgrades, migrations, and decommissioning with minimal downtime.
  • Automation and configuration management (e.g., Ansible, scripting): repeatable builds, patching, health checks, and self-service workflows.
  • Data integrity and security: access controls, encryption where applicable, auditability, and safe operational practices.
  • Collaborate with HPC admins, network engineers, and application teams to align storage behavior with real workload patterns.
  • Maintain operational documentation: runbooks, change plans, post-incident reports, and standard operating procedures.

Requirements and Skills

Requirements

  • 3–7+ years of hands-on experience operating and supporting HPC storage platforms and parallel filesystems in production.
  • Strong Linux systems administration skills (RHEL/Rocky/Ubuntu), including performance tuning, troubleshooting, and automation.
  • Fluent English (written and spoken) for cross-team collaboration and clear operational documentation.

Technical

  • Proven expertise with at least one parallel filesystem: Lustre, IBM Spectrum Scale (GPFS), or BeeGFS (installation, upgrades, configuration, and day-2 operations).
  • Deep understanding of storage and networking fundamentals: RAID, multipathing, NVMe/SAS, TCP/IP, RDMA/InfiniBand, and common bottlenecks in HPC I/O paths.
  • Experience with monitoring and observability for storage systems (metrics, logs, alerting) and systematic incident response/root-cause analysis.
  • Ability to benchmark and tune I/O performance using relevant tools (e.g., fio, IOR, mdtest) and translate results into actionable changes.

Experience

  • Operating clustered environments at scale: capacity planning, lifecycle management, patching windows, and high-availability considerations.
  • Automation/scripting for repeatable operations (Bash and/or Python), including configuration management and safe change workflows.
  • Working closely with compute, networking, and platform teams to plan changes, coordinate maintenance, and resolve cross-domain issues.

Bonus Points

  • Experience with HPC schedulers and user workflows (e.g., Slurm) and how storage performance impacts job throughput.
  • Familiarity with object storage or tiering concepts (e.g., S3, HSM) and integrating them with parallel filesystem environments.
  • Exposure to containers in HPC (Apptainer/Singularity, Docker) and best practices for storage access patterns.
Other open positions:
DevOps Lead
Freelance
Remote
Apply Now
Business Development & Account Executive
Freelance
Remote
Apply Now
Sales Manager
Freelance
Remote
Apply Now
end to end

Application Process

1

Apply

Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.

2

Screening

If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.

3

Get Matched

Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.

Max file size 10MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
* Required
Thank you! We’ve received your application and will review it shortly.
Oops! Something went wrong while submitting the form.