Infra Autopilot — Kubernetes Cost Optimizer

The Problem

Kubernetes clusters are notoriously over-provisioned. Engineers set resource requests and limits conservatively at deploy time and never revisit them — leading to paying for CPU and memory that sits idle. At my employer, we were spending $12,000/month on a cluster that was 60% underutilized by actual workload metrics.

My Approach

I analyzed our Prometheus metrics and found that the gap between requested resources and actual peak usage was consistently 3-5x. Rather than a one-time manual audit, I wanted a repeatable, automated tool that any team could run.

The tool needed to: (1) query Prometheus for historical utilization data, (2) apply statistical analysis (P95 usage, not averages — to catch burst traffic), (3) generate Helm chart patches with new resource specs, and (4) estimate monthly cost savings in dollar terms.

I wrote it in Go for a single-binary distribution — engineers hate dependency management hell for devtools.

What I Built

autopilot scan — Connects to the cluster, queries Prometheus, and generates a JSON report of every deployment with actual vs. requested resources and a confidence score.

autopilot recommend — Processes the scan report and outputs ready-to-apply values.yaml patches with right-sized requests/limits, using P95 historical usage + a configurable safety margin.

autopilot estimate — Takes the current cluster bill (pulled from AWS Cost Explorer API) and projects monthly savings from applying the recommendations.

autopilot apply — Dry-run by default. With --confirm, applies patches via helm upgrade.

Results

Reduced our team’s monthly AWS EKS bill by $4,200/month (35% reduction)
340+ GitHub stars within 6 months, 80+ forks
Adopted by 3 other engineering teams internally after posting on internal Slack
Contributed a Datadog metrics adapter via a community PR

Key Learnings

Statistical sampling matters enormously. First version used average utilization, which was dangerously low for spiky workloads — a burst would OOM-kill pods. Switching to P95 over a 30-day window with a 20% safety margin eliminated all false positives.