Infra Autopilot — Kubernetes Cost Optimizer
An open-source CLI tool that analyzes Kubernetes cluster resource utilization and generates right-sizing recommendations, reducing cloud costs by 20-45%.
The Problem
Kubernetes clusters are notoriously over-provisioned. Engineers set resource requests and limits conservatively at deploy time and never revisit them — leading to paying for CPU and memory that sits idle. At my employer, we were spending $12,000/month on a cluster that was 60% underutilized by actual workload metrics.
My Approach
I analyzed our Prometheus metrics and found that the gap between requested resources and actual peak usage was consistently 3-5x. Rather than a one-time manual audit, I wanted a repeatable, automated tool that any team could run.
The tool needed to: (1) query Prometheus for historical utilization data, (2) apply statistical analysis (P95 usage, not averages — to catch burst traffic), (3) generate Helm chart patches with new resource specs, and (4) estimate monthly cost savings in dollar terms.
I wrote it in Go for a single-binary distribution — engineers hate dependency management hell for devtools.
What I Built
autopilot scan — Connects to the cluster, queries Prometheus, and generates a JSON report of every deployment with actual vs. requested resources and a confidence score.
autopilot recommend — Processes the scan report and outputs ready-to-apply values.yaml patches with right-sized requests/limits, using P95 historical usage + a configurable safety margin.
autopilot estimate — Takes the current cluster bill (pulled from AWS Cost Explorer API) and projects monthly savings from applying the recommendations.
autopilot apply — Dry-run by default. With --confirm, applies patches via helm upgrade.
Results
- Reduced our team’s monthly AWS EKS bill by $4,200/month (35% reduction)
- 340+ GitHub stars within 6 months, 80+ forks
- Adopted by 3 other engineering teams internally after posting on internal Slack
- Contributed a Datadog metrics adapter via a community PR
Key Learnings
Statistical sampling matters enormously. First version used average utilization, which was dangerously low for spiky workloads — a burst would OOM-kill pods. Switching to P95 over a 30-day window with a 20% safety margin eliminated all false positives.