1. Initial state
Taranis is an agriculture tech company.
They take images of crop fields from satellites and drones, analyze them, and provide farmers with insights regarding the state of their fields. Some examples:
- Alerts on damaging insects
- Alerts on areas lacking certain minerals
Behind the scenes, everything is running on Kubernetes in one of the large Cloud providers (confidential due to the client's request).
Their state when they met us:
- Everything was created manually using the Cloud's Console UI
- Frequent Jenkins crashes due to high-scale usage and plugins modifications
- Parts of the system with both a QA and Production environments, and other parts without a QA / Development environment
- Scale varying between 150 - 800 Nodes
Managing the infrastructure became complex, and the method of introducing manual fixes became problematic, and more often than not broke things in production, without a clear audit.
2. Tech stack
3. Project goals
- Make it easier and safer to provision infrastructure
- Enable testing Jenkins changes
4. Decisions
To achieve the goals, we made a couple of decisions:
- Use Pulumi to manage the infrastructure - the developers have knowledge of Typescript, so using Pulumi makes sense in terms of making it accessible for devs to create & own infrastructure
- Create DEV & STAGING environments and gradually add components to it
- Use the new environments to test both infrastructure and application changes before modifying production
5. Restrictions
We one main restrictions:
- No recreation of resources
So we decided to import existing resources into Pulumi and based on the existing Cloud architecture.
The idea was to create the new DEV and STAGING environments out of the same Infrastructure-as-Code that manages production.
6. Strategy
The above goals, decisions, and restrictions, resulted in the following strategy:
- Create a Pulumi Project per the Cloud's oranizational unit (Project / Resource Group / Organization).
- We defined an environment as a Pulumi Stack, so that one Stack could span multiple Pulumi projects and still represent the same environment (DEV / PROD / etc.)
- Start with a 'PROD' Pulumi Stack, to which the existing infrastructure will be imported, and then using the same code spin up an entire DEV / STAGING environment from scratch
- Guiding principle: The only difference between environments must be configuration (stack configuration in this case) - no environment-specific logic in the code
7. The process
The process of transforming Taranis' infrastructure was methodical and detailed:
- Created the Pulumi Projects per Cloud project/resource-group
- Started with importing the resources of the largest Cloud project/resource-group
The import process:
- During the import process, we had a temporary stack to which we imported the existing cloud resources and ran 'pulumi preview' against just to see if it imports those resources as expected - it helped us find issues with some resources that couldn't be imported as is (for example, the state of an imported Kubernetes cluster contains the NodePools / NodeGroups as well, even though it's a separate resource).
- With every import we refactored the code so that the next resource we create comes bundled with the best practices we've already implemented - for example, we ended up creating new Kubernetes clusters for argo-workflows, and created them with autoscaling, GPU/CPU/Memory-intensive Nodes, custom annotations, custom storageClasses, etc - all out of the box.
- Last step was creating a new stack for both dev and staging, and spinning up environments identical to production - we even went as far as creating a dev permutation of Jenkins to test upgrading plugins.
8. Results
- The entire infrastructure is managed using Pulumi
- There are new DEV & STAGING environments (which we even used to test Jenkins plugins upgrade)
- The only difference between the environments is the Pulumi Stack configuration
- The developers started contributing code and taking ownership over the infrastructure they need
- Gaps between environments gradually decreased
Worth mentioning:
We did other things with Taranis as well, such as supporting multi-cluster communication between services, improving monitoring, streamlining the Kubernetes clusters' upgrades, handling Kubernetes deprecations, streamlining developing GRPC services, etc.
9. Before & After
Before ❌
After ✅
Manual infrastructure management via the Cloud console UI
Automated infrastructure management with Pulumi
Frequent Jenkins crashes causing delays
Stable CI/CD pipelines with dedicated environments
Partial development environments for infra and some parts of the system
Introduction of development and staging environments