Categories
Tech TIL

Adventures in tuning unicorn for Kubernetes

There isn’t much detailed info in the wild about folks running Ruby on Rails on Kubernetes (k8s); the Venn diagram for the respective communities doesn’t have a ton of overlap. There are some guides that cover the basics, enough to get a lab up and running, but not much at all about running a production environment. The situation is getting better slowly.

The #kubernetes-ruby Slack channel has all of ~220 people in it versus the thousands found in channels for other languages. If you scroll through the history, you’ll find that most of the questions and responses cover Day 0/1 issues – “How do we make it run?”-type problems.

So, I thought it would be worthwhile to share a bit of my own experience trying to optimize a RoR deployment in Kubernetes, if only to save another Ops-person from stumbling through the same mistakes and dead-ends that I encountered.

Some background

The initial move of our RoR app from Heroku to Kubernetes was relatively straightforward. Much of the effort went into stress-testing the auto-scaling and resource config to find behavior that felt OK-enough to start.

Part of that tweaking required making our worker/container config less dense to be cluster-friendly and provide smooth scaling versus the somewhat bursty scale-up/down we had seen with running a high process count on Heroku dynos.

Generally, small, numerous containers running across many nodes is what you want in a clustered setting. This optimizes for resiliency and efficient container packing. It can also make scaling up and down very smooth.

We settled on an initial config of 2 unicorn (Rails app server) processes per container with 768MB of RAM and 1000 CPU millicores – maxing out at a few hundred k8s pods. This felt bad and abstracted traditional unicorn optimization practices up one level, where more workers per unicorn socket = better routing efficiency, but it seemed to perform OK and denser configs (confusingly) appeared to have worse performance within the cluster. It also jived with the limited documentation we could find for other folks making similar migrations.

The initial goal was satisfied – get it moved with limited downtime and acceptable performance, where “acceptable” was basically “what it was doing on Heroku”. In fact, it seemed to be running a bit better than it had on Heroku.

Slower than it should be

Fast forward a year and tens of thousands more requests-per-minute. Our team decided we wanted to introduce service level indicators/objectives to inform product and infrastructure work. We chose a target for latency and started tracking it. We were doing OK in regard to the target, but not where we felt like we could be (I, personally, wanted a bit more buffer room.), so we started digging in to the causes of slowness within the stack.

It immediately became apparent that we were blind to network performance across some layers of the stack. We were tracking app/database level latency and could derive some of the latency values for other tiers via logs, but the process was cumbersome and too slow for real-time iteration and config tweaking.

A co-worker noticed we were missing an X-Request-Start header in our APM telemetry. We added the config in our reverse-proxy (nginx) and discovered a higher-than-expected amount of request queuing between nginx and unicorn.

That kicked off a round of experiments with nginx, unicorn, and container configs. Some of these provided minor benefit. Boosting the number of app pods reduced the request queuing but also wasted a lot of resources and eventually plateaued. Increasing worker density was minimally successful. We went up to 3 workers w/ 1GB of RAM and saw better performance, but going past that yielded diminishing returns, even when increasing the pod request/limits in parallel.

Network captures weren’t very helpful. Neither were Prometheus metric scrapes (at least not to the degree that I was able to make sense of the data.). As soon as requests entered the k8s proxy network, we were blind to intra-cluster comms until it hit the pod on the other side of the connection. Monitoring the unicorn socket didn’t show any obvious problems, but all the symptoms signaled that there was a bottleneck between nginx and unicorn, what you would see if connections were stacking up on the unicorn socket. We couldn’t verify that was what the actual issue was though.

After investing quite a bit of time going down sysctl and other rabbit holes, we decided to set the problem aside again to focus on other work. We had yielded some performance improvements and everything was performing “OK” versus our SLO.

Give it the juice

One of the goals of the SLI/SLO paradigm is to take measurements as close to the customer as is practical. In the spirit of that, we moved our latency measurement from the nginx layer to the CDN. We had avoided the CDN measurement previously, because the time_taken value in AWS Cloudfront logs is problematic and sensitive to slow client connections. However, AWS recently added a origin_latency value to their logs, making this tracking more practical and consistent.

Once we made the switch, we were sad. Per the updated measurement point, we were performing much worse than expected the closer we got to the client. This kicked off a new round of investigation.

Much of the unexpected latency was due to geography and TLS-handshakes. We detailed out some of the other causes and potential mitigations, listing unicorn config improvements as one of them. I set the expectation for those improvements low given how much time we had already invested there, and how mixed the results were.

But we gave it another go.

This time around, we introduced linkerd, a k8s service mesh, into the equation. This gave us better visibility into intra-cluster network metrics. We were able to test/iterate in real-time in our sandbox environment

We performed some experiments swapping out unicorn with puma (We hadn’t done this previously due to concerns about thread-safety, but it was safe enough to test in the sandbox context.). Puma showed an immediate improvement versus unicorn at low traffic and quickly removed any doubt that there was a big bottleneck tied directly to unicorn.

We carved out some time to spin up our stress test infra and dug in with experiments at higher traffic levels. Puma performed well but also ran into diminishing returns pretty quickly when adding workers/threads/limits. While troubleshooting we noticed that if we set a CPU limit of 2000 millicores, the pods would never use more than 1000. Something was preventing multi-core usage.

That something turned out to be a missing argument that I’ve so far only found in one place in the official k8s docs and was missing from every example of a deployment config that I’ve come across to date.

apiVersion: v1
kind: Pod
metadata:
  name: cpu-demo
  namespace: cpu-example
spec:
  containers:
  - name: cpu-demo-ctr
    image: vish/stress
    resources:
      limits:
        cpu: "1"
      requests:
        cpu: "0.5"
    args:
    - -cpus
    - "2"

“The args section of the configuration file provides arguments for the container when it starts. The -cpus "2" argument tells the Container to attempt to use 2 CPUs.”

Turns out, it doesn’t matter how many millicores you assign to a request/limit via k8s config if you don’t enable a corresponding number of cores via container argument. The reason this argument is so lightly documented in the context of k8s is that it has little to nothing to do with k8s. -cpus is related to Linux kernel cpuset and allows Docker (and other containerization tools) to configure the cgroup a container is running in with limit overrides or restrictions. I’ve never had to use it before, so I knew nothing about it.

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

So many tables flipped… not enough tables.

With higher worker count, CPU/RAM limits, AND a container CPU assignment override, unicorn actually performed better than puma (This can be the case in situations where your app is not I/O constrained). Almost all of our request queuing went away.

We eventually settled on 8 workers per pod with 2000/4000 CPU request/limit and 2048/3584 MB RAM request/limit as a nice compromise in density vs. resiliency and saw an average of 50ms improvement in our p95 response time. (It’s possible we’ll tweak this further as we monitor performance over time.

The issue had been a unicorn socket routing bottleneck the entire time, just as had been suspected earlier on. The missing piece was the CPU override argument.

Note: the default value for -cpus is ‘unlimited’. In our case something, like a host param, was overriding that.

What have we learned

There are a few things worth taking away here in my mind.

  1. Yet again, k8s is not a panacea. It will not solve all your problems. It will not make everything magically faster. In fact, in some cases, it may make performance worse.
  2. Before you run out and start installing it willy-nilly, Linkerd (or service meshes in general) will not solve all your problems either. It was helpful in this context to enable some troubleshooting, but actually caused problems during stress testing when we saturated the linkerd proxy sidecars, which in turn caused requests to fail entirely. I ended up pulling the sidecar injection during testing rather than fiddling with additional resource allocation to make it work correctly.
  3. For all the abstraction of the underlying infrastructure that k8s provides, at the end of the day, it’s still an app running on an OS. Knowledge and configuration of the underlying stack remains critical to your success. You will continually encounter surprises where your assumptions about what is and is not handled by k8s automagic are wrong.
  4. Layering of interdependent configurations can be a nightmare to troubleshoot and can make identifying (and building confidence in) root causes feel almost impossible. Every layer you add increases complexity and difficulty exponentially. Expertise in individual technologies (nginx, unicorn, Linux, k8s, etc) helps, but isn’t enough. Understanding how different configurations interact with one another across different layers in different contexts presents significant challenges.

Photo by Joen Patrick Caagbay on Unsplash