Scalable Biotech Cloud Infrastructure with Apache Airflow, Kubernetes, AWS, and Terraform

We used Terraform & Terragrunt to create the infrastructure, Kubernetes (AWS EKS) to orchestrate everything, Argo CD for GitOps, and Apache Airflow deployed on the Kubernetes cluster, and a DEV and a PROD environments for a smooth development process.
Scalable Biotech Cloud Infrastructure with Apache Airflow, Kubernetes, AWS, and Terraform
Stealth StartupStealth Startup
Company name:
Stealth Startup
Industry:
Biotechnology
R&D size:
20
Scale:
10 - 200 Nodes
 Buldings Icon

1. Initial state

This startup is at the forefront of developing phage cocktails and personalized treatments to target and destroy harmful bacteria in chronic diseases. Before our collaboration started, their data pipelines and batch jobs were in early stages, and infrastructure management was mostly manual and fragmented.

  • Infrastructure wasn't defined as code.
  • The deployment process was custom and not standardized.
  • There was no environments separation.
  • Pipeline orchestration was lacking consistency and automation.
  • Difficulty onboarding new contributors and maintaining the pipelines.
Target Icon

3. Project goals

  • Build the infrastructure for orchestrating the data-pipelines of the informatics team.
  • Make developing & testing data-pipelines in collaboration smooth.
  • Automate the flow from development to production using CI/CD and GitOps practices with Argo CD.
Checklist Icon

4. Decisions

To achieve the goals, these are the decisions we made:

  • Use Terraform and Terragrunt to provision all of the infrastructure.
  • Use Apache Airflow to orchestrate the data-pipelines, and deploy it on Kubernetes for scaling flexibility.
  • Use Argo CD to continuously deploy the data-pipelines & the apps using a GitOps approach.
  • Created a CI/CD process & branching strategy to enable pushing changes through the environments in an organized way.
Lock Icon

5. Restrictions

Map Icon

6. Strategy

We created reusable Terraform modules and Helm charts for infrastructure and applications, deployed development and production environments, and ensured everything was fully GitOps-managed. Pipelines were developed and tested with dynamic storage solutions, automated syncing, and retry mechanisms to improve resilience.

Settings Icon

7. The process

We did a few things in parallel:

  • Created Terraform Modules (VPC, EKS, etc.)
  • Created a repository for the data-pipelines

Then we continued with a few more parallel phases:

  • Created the Kubernetes cluster in AWS and bootstrapped it
  • Created a sample data pipeline & tested it locally on Apache Airflow on a local Kubernetes

After that we used ArgoCD to deploy Apache Airflow on the EKS cluster, created a CI/CD process for the data-pipelines, and implemented autoscaling on the EKS cluster using Karpenter.

Chart Icon

8. Results

Table Icon

9. Before & After

Before ❌

After ✅

Manual infrastructure setup

Infrastructure as code

No clear environments

Dedicated environments for dev and prod

Inconsistent DAG orchestration

Automated DAGs deployment and testing with GitOps

Debugging pipelines was time-consuming

Improved reliability and developer experience

Highlight Example

Explore how we can achieve something similar with you