Thank you! We’ve received your application and will review it shortly.

Oops! Something went wrong while submitting the form.

Kafka Expert

Engineering

Remote

Part Time

Freelance Kafka specialist needed to stabilize and modernize a 6–7 year old on-prem VMware Kafka cluster.

Join Us

Join Us Learn More

Remote

Part Time

Engineering

Job Overview

MeteorOps is looking for a freelance Kafka troubleshooting & modernization specialist to step into an older on-prem Kafka environment that currently has no dedicated Kafka owner and limited observability. The cluster supports real-time market quote / HFT tick data at very high throughput (potentially millions of messages/sec) and feeds downstream systems including downsampling services and a SQL Server writer, eventually supporting trading execution workflows.

The Kafka setup is 6–7 years old, deployed on VMware on-prem VMs with 10 Kafka brokers and 5 ZooKeepers, running Kafka 2.13-3.0.0. Each broker has multiple data disks (currently stated as 7 disks ~1TB each; prior notes mention higher disk counts—part of the engagement will be to verify actual layout). Historically disk usage sits around ~10%, but recently one or more brokers spiked toward 100%, coinciding with application Kafka errors and broker/topic instability (e.g., missing leader, invalid partition, impaired topic failover).

You’ll diagnose the incident and underlying risks, produce a clear findings + recommendations report, and help the engineering team implement pragmatic improvements: monitoring, tooling, operational runbooks, resilience/failover improvements, and an assessment of upgrade options (including a path away from ZooKeeper).

Responsibilities

Rapid triage & incident diagnosis
- Confirm scope of the recent issue (disk saturation, broker health, controller/ZK health, partition leadership, ISR, replication, rebalances).
- Determine why disk utilization jumped from ~10% to near 100% (retention changes, log segment growth, stuck cleanup, partition skew, under-replication, data directory imbalance, etc.).
- Identify root causes of missing leader, topic access failures, and invalid partition behavior.
Cluster assessment & hardening plan
- Review broker configuration, topic settings (replication factor, min.insync.replicas, retention), partition distribution, rack awareness (if any), and failover behavior.
- Evaluate ZooKeeper reliability and operational risks; document current failure domains and bottlenecks.
Observability & operations uplift
- Propose and/or implement proper Kafka monitoring (broker + ZK + OS/disk), dashboards, and alerting (lag, under-replication, disk, controller events, request latency, GC, network, etc.).
- Recommend stronger GUI/management tooling beyond read-only usage (currently Kafdrop and Zabbix are used but limited).
Engineering enablement
- Produce a Findings Report and Recommendations / Roadmap (quick wins + medium/long-term).
- Create runbooks for safe operations: broker restart procedure, partition reassignments, capacity checks, config backups, upgrades with minimal disruption, and recovery steps.
- Coach the engineering team and IT on day-to-day Kafka ops and troubleshooting patterns.
Optional improvement work
- Execute selected remediations (e.g., storage rebalancing, retention tuning, partition reassignment, leader imbalance fixes).
- Assess and plan Kafka upgrade strategy, including ZooKeeper removal (KRaft migration path) if appropriate for their risk tolerance and timelines.
- Improve resilience posture toward minimized RTO/RPO (goal: “as low as practical,” possibly ~1 minute max data loss tolerance).

Requirements and Skills

Must-have

Proven hands-on experience operating Kafka in production, including high-throughput clusters.
Strong troubleshooting of:
- Partition leadership issues, missing leaders, ISR shrinkage, under-replicated partitions
- Broker restarts and safe recovery without “sledgehammer” approaches
- Storage/disk issues on multi-disk broker layouts (JBOD patterns, partition skew, log retention/cleanup behavior)
Linux systems competence: disk/IO analysis, filesystem saturation, process/resource analysis, networking basics.
Experience with ZooKeeper-based Kafka clusters and operational best practices.
Ability to deliver clear, actionable documentation: findings, recommendations, and runbooks.
Strong communication skills for working with a mixed team (engineering + IT unfamiliar with Kafka).

Nice-to-have

Experience with Kafka monitoring stacks (e.g., JMX metrics pipelines, Prometheus/Grafana, lag monitoring, alerting design).
Experience with GUI/admin tooling and governance practices (RBAC, auditing approach, safer topic/config workflows).
Experience planning Kafka upgrades and migrations, including evaluation of KRaft readiness and risk.
Familiarity with workloads involving market data / trading systems and latency-sensitive pipelines.
Experience with VMware-based on-prem operations and capacity planning.

Other open positions:

QA Engineer

Freelance

•

Remote

Apply Now

DevOps Lead

Freelance

•

Remote

Apply Now

HPC Storage Engineer

Freelance

•

Remote

Apply Now

end to end

Application Process

1

Apply

Submit your CV, LinkedIn, and GitHub via the form. We’ll review your profile.

2

Screening

If your skills align, we'll reach out for a quick conversation to understand your experience and project preferences.

3

Get Matched

Once selected, we’ll match you with a client project that fits your expertise. A brief onboarding ensures you're set up with our tools and ready to start.

Thank you! We’ve received your application and will review it shortly.

Oops! Something went wrong while submitting the form.