Back to Blog
18 min read
By James Park

Kubernetes Battle Scars: What They Don't Tell You

War stories from managing 200+ microservices in production. From OOMKilled pods to certificate renewal nightmares.

Kubernetes promises to solve all your infrastructure problems. In reality, it introduces new ones you never imagined. After three years managing 200+ microservices in production, here are the lessons we learned the hard way.

Resource limits will bite you. We thought setting memory limits was straightforward until we discovered the difference between working set and RSS memory. Pods were OOMKilled despite appearing to have headroom. Solution: monitor actual memory usage patterns over weeks, set limits at the 99th percentile, and use quality-of-service classes properly.

Networking is a black box until it breaks. We spent two days debugging intermittent connection failures before discovering our CNI plugin had a race condition during pod restarts. Network policies seemed simple in testing but caused mysterious blocking in production. Documentation is our best friend—reading actual implementation code is better.

Certificate management is a nightmare. Let's Encrypt rate limits, expired certificates bringing down ingress controllers, and cert-manager failing silently. We now have monitoring specifically for certificate expiration, automated renewal testing in staging, and backup certificate provisioning workflows.

Storage is where stateful applications go to die. PersistentVolume provisioning is straightforward until you need to resize volumes, migrate between storage classes, or handle node failures. We learned to avoid stateful workloads in Kubernetes when possible, and when necessary, run them on dedicated node pools with extremely careful volume management.

Health checks seem simple but are critical to get right. Liveness probes that are too aggressive cause cascading failures. Readiness probes that succeed before applications are truly ready send traffic to unprepared pods. We instrument applications with dedicated health endpoints that check actual dependencies.

The biggest lesson? Kubernetes is a platform for building platforms. The complexity is real, but the operational benefits at scale justify the investment—if you have the engineering resources to do it properly.