This article was originally published on the Mattermost Engineering Blog.

The Challenge of Toil

In the world of Site Reliability Engineering, one of our primary goals is to reduce toil - the manual, repetitive work that tends to scale linearly with service growth. Our journey to automate node rotation for AMI releases exemplifies this mission.

Key Challenges We Faced

  1. Limited Flexibility in Kops: The sequential node rotation in Kubernetes Operations (kops) wasn’t meeting our needs for flexible, environment-specific handling.
  2. EKS Automation Gaps: AWS EKS clusters lacked automated node rotation capabilities for new AMI releases.

Our Solution

We developed a comprehensive solution combining existing tools with new implementations:

Tools and Implementation

Workflow Improvements

For kops clusters:

For AWS EKS clusters:

Impact and Benefits

This automation initiative:


This post summarizes our detailed engineering blog post. For complete technical details and implementation specifics, please visit the original article.