Etcd storage
🔍 Quick Overview: Struggling with bloated etcd memory and disk usage? Join me as I walk through diagnosing conflicting metrics, discover a zero-downtime defragmentation tool, and automate a CronJob that slashed my etcd footprint from gigabytes to megabytes-keeping my Kubernetes control plane lean and highly available.
🌱 Early Kubernetes Journey with Kubespray
I first dipped my toes into Kubernetes by choosing Kubespray for cluster provisioning. Its Ansible-based playbooks and curated community roles enabled me to stand up a fully functional cluster with minimal YAML and manual steps.
The Initial Cluster
Control Plane
- Single master node hosting the API Server, Controller Manager, Scheduler, and etcd.
Worker Nodes
- Three nodes running application pods behind a basic Cilium CNI overlay.
This lean setup helped me understand pod scheduling, Services, DNS, and rolling updates without the complexity of HA or custom manifests.
Pro Tip
Leverage Kubespray's inventory
and group_vars
to customize networking and storage classes from day one.
⚙️ Scaling to High-Availability
As production workloads increased, I upgraded to high availability:
3-Node HA Control Plane
- Added two extra master nodes behind an external load balancer.
- Kubespray's built-in etcd clustering and certificate generation simplified the process.
4 Worker Nodes
- Expanded from three to four workers to balance increased CPU and memory demands.
This transition eliminated single points of failure, allowed master node upgrades with zero downtime, and boosted scheduling capacity.
📊 Evolution of Monitoring
To maintain visibility as my cluster grew, I adopted a full observability stack leveraging the Prometheus Operator and Grafana:
Grafana Alloy
- Automatically discover and scrape etcd, API server, kubelet, CoreDNS, and all custom application endpoints.
Grafana Mimir
- Durable, scalable time-series store for high-cardinality metrics with long retention.
With this stack, I tracked etcd-specific metrics like etcd_server_has_leader
, compaction stats, and database sizes alongside application KPIs.
🔍 The etcd Memory & Size Anomaly
Despite seemingly healthy dashboards, I observed conflicting metrics:
Metric | Before | Notes |
---|---|---|
RSS Memory per etcd pod | 1.65 GiB | High baseline even with < 50 MiB data |
db_total_size_in_bytes |
1.45 GiB | Matches RSS inflation |
db_total_size_in_use_in_bytes |
45 MiB | Only ~3% of total space actually used |
Dashboard header capacity / usage | 560 MiB / 73% | Seemed capped and outdated |
Why This Matters
Index Growth: etcd keeps a full in-memory index, so unused pages bloat RAM.
Leader Latency: Large DB sizes slow down compaction and leader elections.
Quotas & Alarms: Risk of hitting backend quotas (etcd_server_quota_backend_bytes
) unexpectedly.
Reconciling these metrics was crucial to ensure control-plane stability.
🛠️ Solution Strategy
Community Research
- Scoured GitHub issues, Kubernetes Slack, and tech blogs.
- Consensus: defragmenting the Bolt DB is necessary to reclaim free pages.
Trying etcdctl defrag
etcdctl endpoint defrag \
--endpoints=<masters> \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
- Downside: Locks the DB, blocking writes and risking availability.
Discovering ahrtr/etcd-defrag
- Containerized job that defragments each member sequentially.
- Non-blocking, rule-based triggers for automated runs.
🚀 Implementing Defragmentation with a CronJob
To automate defrag only when necessary, I deployed this CronJob
to run at 09:14 on weekdays:
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-defrag
spec:
schedule: "14 9 * * 1-5"
jobTemplate:
spec:
template:
spec:
hostNetwork: true # Access etcd on localhost
restartPolicy: OnFailure
securityContext:
runAsUser: 0
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- operator: Exists
effect: NoExecute
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
containers:
- name: etcd-defrag
image: ghcr.io/ahrtr/etcd-defrag:v0.26.0
args:
- --endpoints=https://127.0.0.1:2379
- --cacert=/ca.pem
- --cert=/member-k8s-master-2.pem
- --key=/member-k8s-master-2-key.pem
- --cluster
- --defrag-rule
- "dbQuotaUsage > 0.8 || dbSize - dbSizeInUse > 200*1024*1024"
volumeMounts:
- name: ca-crt
mountPath: /ca.pem
readOnly: true
- name: client-crt
mountPath: /member-k8s-master-2.pem
readOnly: true
- name: client-key
mountPath: /member-k8s-master-2-key.pem
readOnly: true
volumes:
- name: ca-crt
hostPath:
path: /etc/ssl/etcd/ssl/ca.pem
type: File
- name: client-crt
hostPath:
path: /etc/ssl/etcd/ssl/member-k8s-master-2.pem
type: File
- name: client-key
hostPath:
path: /etc/ssl/etcd/ssl/member-k8s-master-2-key.pem
type: File
Defrag Rule Explained
dbQuotaUsage > 0.8
: Over 80% quota utilizationdbSize - dbSizeInUse > 200*1024*1024
: >200 MiB of reclaimable space
🎉 Conclusion & Takeaways
The automated defragmentation ran smoothly, yielding immediate improvements:
- Memory dropped from ~1.65 GiB → 250 MiB
- DB Total Size shrank from ~1.45 GiB → 80 MiB
- DB Used remained ~40 MiB
- Dashboard now shows 2 GiB capacity & 4% usage
Maintain Defragmentation
- Schedule regular defrag jobs to prevent silent bloat.
- Monitor
db_total_size_in_bytes
anddb_total_size_in_use_in_bytes
with alerts. - Keep etcd lean to ensure fast leader elections and API responsiveness.