Enabling Robotics with Cloud IaC, Connectivity, Automation & Observability

How we helped Skyline Robotics manage & monitor their fleet of robots securely, automate CI/CD processes, and organize infrastructure and automations
Enabling Robotics with Cloud IaC, Connectivity, Automation & Observability
Skyline RoboticsSkyline Robotics
Company name:
Skyline Robotics
Industry:
Robotics
R&D size:
10-50 Engineers
Scale:
20 - 100 Nodes
 Buldings Icon

1. Initial state

Skyline Robotics is a deep tech robotics and automation company. Its flagship product, Ozmo, the world’s first high-rise window cleaning robot, is disrupting the $40B window cleaning industry as a safer, more efficient, and more effective alternative to humans.

Their state when they met us:

  • The CI/CD process was very slow, inefficient and error-prone.
  • Setting up a robot and deploying code involved too many manual steps. Developers and testers had to travel to the robot fields to operate the deployment via ssh. 
  • There was only a dev environment.
  • Provisioning and deployment of other satellite services was manual.
Target Icon

3. Project goals

  • Standardize and enhance the CI/CD pipelines with caching and error handling for both the robot core and satellite services. 
  • Simplify the deployment process to make it efficient and fault tolerant.
  • Fully managed infrastructure by IaC.
  • Enable the robots with connectivity through a reverse tunnel.
Checklist Icon

4. Decisions

  • Implement downstream pipelines, leverage file-level change detection, and introduce caching to streamline and accelerate the CI/CD process.
  • Ansible for managing deployment scripts.
  • Terraform & Terragrunt to manage infrastructure-as-code.
  • AWS SSM as robot VM management tool and connectivity enabler.
Lock Icon

5. Restrictions

  • Robots must initiate all network communication due to NAT and firewall constraints in deployment environments, necessitating a secure, reliable, and scalable outbound connection model.
  • Once deployed, physical access to robots is highly restricted, so all management, diagnostics, and updates must be executable remotely without on-site intervention.
  • Robots may experience periods of limited or no connectivity due to environmental or geographic factors, requiring local resilience and the ability to gracefully resume cloud synchronization once reconnected.
  • Upon re-establishing connectivity, robots must be capable of updating the central system with their latest operational state, metrics, and logs to ensure consistent fleet visibility and monitoring.
  • The system must support autonomous health checks and self-reporting mechanisms so that robots can detect issues, perform corrective actions, and alert the cloud infrastructure without manual triggering.
Map Icon

6. Strategy

  • Reorganized the GitLab repository into three separate repos—application, test, and deployment to reduce complexity. The CI pipeline triggers downstream pipelines across these repos, with appropriate permissions assigned to development, testers, and operations team members. It also leverages file change detection, GitLab CI caching, and Docker layer and multi-build caching.
  • Integrated AWS SSM installation and registration into the robot initialization hook during manufacturing, allowing secure remote reverse SSH and port-forwarding access to robots via the AWS Console without needing to know their network addresses. 
  • Ansible is used to manage deployment steps and verify post-deployment status, with AWS SSM-run Ansible playbooks leveraged for continuous deployment. 
  • Migrated manually provisioned resources to Terraform and Terragrunt, with pipelines set up to automatically run plan and apply operations on Git changes.
Settings Icon

7. The process

  • We recognized that the repository reorganization could introduce breaking changes and disrupt the team's workflow, so we approached it carefully by minimizing the scope of changes and ensuring a latest backup was always taken before any rollout. To maintain alignment, we conducted weekly review sessions with the development team to explain the updates and guide them in adopting best practices.
  • During the AWS SSM integration, we collaborated closely with the manufacturing team to understand their installation process, aligning our integration with their standards to ensure smooth provisioning of the robots. Given that robots had previously been set up manually by various users, it was difficult to track which tools and versions were required. To address this, we used AWS SSM Inventory to collect installed dependencies from existing robots and incorporated this information into our Ansible playbook to avoid missing any components. We also replicated a robot VM for efficient testing and validation of the playbook on clean environments. Furthermore, we coordinated with the operations team to perform deployment "fire drills" on multiple physical robots, validating our approach and celebrating successful rollouts.
  • To ensure staging and production environments could be prepared efficiently and reliably, we advocated for using managed services wherever possible to reduce operational overhead. We applied infrastructure best practices, including organizing Terraform modules clearly, structuring Terragrunt environment configurations, and implementing cost optimization strategies such as scale-in/scale-out schedules, instance type selection for utilization efficiency, and recommending a savings plan acquisition.
Chart Icon

8. Results

  • Build steps are now triggered only when changes are detected in the source code, significantly reducing unnecessary work. Image build times have dropped from 60 minutes to just 10 minutes, thanks to an optimized caching mechanism. 
  • Ansible combined with AWS SSM enables seamless deployment to groups of robots using tagging, with deployment results automatically synced to pipeline outputs. This setup allows team members to operate and monitor field-deployed robots remotely via the AWS Console, eliminating the need for on-site presence. 
  • Infrastructure is fully synchronized with GitLab; any changes must be submitted via merge requests and reviewed through Terraform / Terragrunt plan previews before being applied. 
Table Icon

9. Before & After

Before ❌

After ✅

Build steps take 60 minutes and are error-prone.

Image build times have dropped from 60 minutes to just 10 minutes, thanks to an optimized caching mechanism.

Setting up a robot and deploying code involved many manual steps; developers had to travel to robot locations.

Ansible and AWS SSM enable seamless remote deployment to robots.

There was only a development environment.

Infrastructure is fully synchronized with GitLab, with changes reviewed before application.

Highlight Example

Explore how we can achieve something similar with you