Skip to content

Etcd storage

🔍 Quick Overview: Struggling with bloated etcd memory and disk usage? Join me as I walk through diagnosing conflicting metrics, discover a zero-downtime defragmentation tool, and automate a CronJob that slashed my etcd footprint from gigabytes to megabytes-keeping my Kubernetes control plane lean and highly available.

🌱 Early Kubernetes Journey with Kubespray

I first dipped my toes into Kubernetes by choosing Kubespray for cluster provisioning. Its Ansible-based playbooks and curated community roles enabled me to stand up a fully functional cluster with minimal YAML and manual steps.

The Initial Cluster

Control Plane

  • Single master node hosting the API Server, Controller Manager, Scheduler, and etcd.

Worker Nodes

  • Three nodes running application pods behind a basic Cilium CNI overlay.

This lean setup helped me understand pod scheduling, Services, DNS, and rolling updates without the complexity of HA or custom manifests.

Pro Tip

Leverage Kubespray's inventory and group_vars to customize networking and storage classes from day one.


⚙️ Scaling to High-Availability

As production workloads increased, I upgraded to high availability:

3-Node HA Control Plane

  • Added two extra master nodes behind an external load balancer.
  • Kubespray's built-in etcd clustering and certificate generation simplified the process.

4 Worker Nodes

  • Expanded from three to four workers to balance increased CPU and memory demands.

This transition eliminated single points of failure, allowed master node upgrades with zero downtime, and boosted scheduling capacity.


📊 Evolution of Monitoring

To maintain visibility as my cluster grew, I adopted a full observability stack leveraging the Prometheus Operator and Grafana:

Grafana Alloy

  • Automatically discover and scrape etcd, API server, kubelet, CoreDNS, and all custom application endpoints.

Grafana Mimir

  • Durable, scalable time-series store for high-cardinality metrics with long retention.

With this stack, I tracked etcd-specific metrics like etcd_server_has_leader, compaction stats, and database sizes alongside application KPIs.


🔍 The etcd Memory & Size Anomaly

Despite seemingly healthy dashboards, I observed conflicting metrics:

Metric Before Notes
RSS Memory per etcd pod 1.65 GiB High baseline even with < 50 MiB data
db_total_size_in_bytes 1.45 GiB Matches RSS inflation
db_total_size_in_use_in_bytes 45 MiB Only ~3% of total space actually used
Dashboard header capacity / usage 560 MiB / 73% Seemed capped and outdated

Why This Matters

Index Growth: etcd keeps a full in-memory index, so unused pages bloat RAM.

Leader Latency: Large DB sizes slow down compaction and leader elections.

Quotas & Alarms: Risk of hitting backend quotas (etcd_server_quota_backend_bytes) unexpectedly.

Reconciling these metrics was crucial to ensure control-plane stability.


🛠️ Solution Strategy

Community Research

  • Scoured GitHub issues, Kubernetes Slack, and tech blogs.
  • Consensus: defragmenting the Bolt DB is necessary to reclaim free pages.

Trying etcdctl defrag

etcdctl endpoint defrag \
  --endpoints=<masters> \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
  • Downside: Locks the DB, blocking writes and risking availability.

Discovering ahrtr/etcd-defrag

  • Containerized job that defragments each member sequentially.
  • Non-blocking, rule-based triggers for automated runs.

🚀 Implementing Defragmentation with a CronJob

To automate defrag only when necessary, I deployed this CronJob to run at 09:14 on weekdays:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-defrag
spec:
  schedule: "14 9 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true # Access etcd on localhost
          restartPolicy: OnFailure
          securityContext:
            runAsUser: 0
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              operator: Exists
              effect: NoSchedule
            - operator: Exists
              effect: NoExecute
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: node-role.kubernetes.io/control-plane
                        operator: Exists
          containers:
            - name: etcd-defrag
              image: ghcr.io/ahrtr/etcd-defrag:v0.26.0
              args:
              - --endpoints=https://127.0.0.1:2379
              - --cacert=/ca.pem
              - --cert=/member-k8s-master-2.pem
              - --key=/member-k8s-master-2-key.pem
              - --cluster
              - --defrag-rule
              - "dbQuotaUsage > 0.8 || dbSize - dbSizeInUse > 200*1024*1024"
              volumeMounts:
              - name: ca-crt
              mountPath: /ca.pem
              readOnly: true
              - name: client-crt
              mountPath: /member-k8s-master-2.pem
              readOnly: true
              - name: client-key
              mountPath: /member-k8s-master-2-key.pem
              readOnly: true
          volumes:
          - name: ca-crt
            hostPath:
            path: /etc/ssl/etcd/ssl/ca.pem
            type: File
          - name: client-crt
            hostPath:
            path: /etc/ssl/etcd/ssl/member-k8s-master-2.pem
            type: File
          - name: client-key
            hostPath:
            path: /etc/ssl/etcd/ssl/member-k8s-master-2-key.pem
            type: File

Defrag Rule Explained

  • dbQuotaUsage > 0.8: Over 80% quota utilization
  • dbSize - dbSizeInUse > 200*1024*1024: >200 MiB of reclaimable space

🎉 Conclusion & Takeaways

The automated defragmentation ran smoothly, yielding immediate improvements:

  • Memory dropped from ~1.65 GiB → 250 MiB
  • DB Total Size shrank from ~1.45 GiB → 80 MiB
  • DB Used remained ~40 MiB
  • Dashboard now shows 2 GiB capacity & 4% usage

Maintain Defragmentation

  • Schedule regular defrag jobs to prevent silent bloat.
  • Monitor db_total_size_in_bytes and db_total_size_in_use_in_bytes with alerts.
  • Keep etcd lean to ensure fast leader elections and API responsiveness.