Monitor available IPs with Lambda and Cloudwatch

I ran into a situation where I needed to keep track of available IPs related to an AWS EKS cluster and couldn’t find any off the shelf tooling in AWS or otherwise to do so.

Tangential gripe: The reason I needed the monitor is because EKS doesn’t support adding subnets to a cluster without re-creating it and the initial subnets that were used were a little too small due to reasons. I wanted to have a sense of how much runway I had pending AWS fixing the gap or me implementing a work around.

So, I cobbled together a Lambda function to pull the info and pipe it into Cloudwatch.

Gist here

I’m using tags to scope the subnets I want to track, rather than piping in everything – since Cloudwatch custom metrics cost money. But you could use whatever filters you wanted.

After getting the data into Cloudwatch, I was quickly reminded that you can’t alarm off of multiple metrics, so I used a math expression (MIN) to group them instead. This works for up to 10 metrics (this post should really be titled “The numerous, random limitations of AWS”), which luckily worked for me in this instance.

Then I setup an alarm for the threshold I wanted and tested it – it worked. Fun times.

Photo by Taylor Kopel on Unsplash


Do NOT use wildcard alternate domains in AWS CloudFront

CloudFront configs allow for alternate domain names if you’d like to use a custom domain for your CDN distribution. You likely want that.

The alternate domain list can include wildcard subdomains, like * – see the docs. This is handy for dev and experiment environments so you don’t have to be constantly updating the config.

Imagine you have a scenario where you have a Route53 (DNS) & CloudFront config that looks like: - ALIAS - CloudFrontDistroA 
(where is defined as an alias) - CNAME - - ALIAS - CloudFrontDistroB 
(where * is defined as an alias)

What would you expect would happen when traffic went to ?

If you’re not an insane person, you’d likely expect it to be directed to, because that’s how DNS CNAMEs work.

You’ve probably guessed by now that this isn’t what actually happens.

Per AWS support, when traffic is routed to a CloudFront point of presence (POP), it will ignore the DNS config and route to the most specific rule across the CloudFront distributions in your account.

So, * will not only catch ANY CNAME that doesn’t have an explicit mapping across all of your distributions’ alternate domain lists. It will also route ANY CNAME there taht doesn’t have an explicit mapping. That makes a sort of sense if you’re thinking about a wildcard’s behavior in isolation. The first part at least, the second part is rubbish.

What this means in practice though is that one distribution’s configuration may affect another distribution in unexpected ways.

To AWS’ credit, this is documented… barely and for the inverse scenario.

If you have overlapping alternate domain names in two distributions, CloudFront sends the request to the distribution with the more specific name match, regardless of the distribution that the DNS record points to. For example, is more specific than *

I could see a use case for wanting an optional catch-all. I cannot fathom why anyone would desire this as the hard-coded default behavior. It completely fails the Principal of Least Surprise.

So my friends, do not use wildcard subdomains in your CloudFront Alternate Domain config. Learn from my mistakes. I was a fool for believing we could have nice things.

People Tech

Frequently asked tech career questions

I’ve reached the point in my career and have enough gray hair that people sometimes ask me for career advice. Usually they are in tech, trying to make a transition into a different role, or are trying to get into tech. I do my best to answer these questions because it is obvious that they have experienced some sort of trauma and/or have run out of better options if they are at the point where they can look into the melancholy and barely obscured madness behind my eyes and think “This is someone who can help me.”

To save others from having to brave the same journey, here is a small FAQ that reflects the common themes of those discussions.

How do I get started in X?

You just do the thing and keep doing it until someone is willing to pay you money to do it.

You become a writer by writing.

You become an engineer by engineering.

Unless you need a license (*I* won’t judge how you learned to remove appendixes, but others might.), you just need to start doing the thing.

This is not the answer people want or are looking for. But it really is that simple. There are no shortcuts or lifehacks. Simple != easy and that sometimes makes our brains sad. I’m sorry.

What tech do I need to learn to do X?

There are admittedly different fundamentals required for different types of jobs, but if you’re asking about specific technologies the answer is that it doesn’t really matter.

Python or Ruby? Doesn’t matter.

AWS or Azure? If you can figure out one, you can figure out the other.

If you have a specific role in mind, do some research on the thematic pillars of that role and pick 1-2 pieces of tech in each pillar to focus on. Do not go down the rabbit hole of infinite research trying to decide on the perfect lineup of tech to learn. That’s not a thing. You’ll just delay doing anything useful.

There is one “trick” I recommend here. People have mixed opinions about certifications, but if you can use studying for a target certification as a learning path, go for it. It may focus you a bit.

When is it OK to list X on my resume?

Have you made a thing or two that worked using X and can speak about it in the appropriate context? OK then, go for it unless the job was explicitly defined by X, like “Unity Developer” or “AutoCAD Tech”. The ancillary stuff is fine to list if it is relevant to the role and you have actually used it.

Is X just a word you saw somewhere? Do not put it on your resume. I generally don’t delight in making people squirm in interviews, but I will create a “learning moment” that may be uncomfortable if I catch someone keyword stuffing.

How do I answer interview questions about things I haven’t done without lying?

I have never lied in a job interview. I have, however, re-framed my experience in the context of the job I was applying for.

When it comes to the actual interview where you’re asked “Have you done X using Y?” and the answer is “No”, try going with “I’ve done A using B which is very similar to Y” – shifting the focus to how you solved a similar problem rather than the specific one they asked about. Granted, you can’t pull this trick for every question without coming off as evasive, but it’s a legit tool.

It’s also OK to not volunteer unprompted caveats like “I’ve done X… but never in production.” They don’t need to know that. You aren’t marrying this person. Just say “Yes” and move on.

Nailing an interview is more about being able to sell yourself than proving what you know, with exception of coding interviews which are a crap shoot (I prefer take home exercises.). The “selling” part feels icky to a lot of us, but it shouldn’t. All you are saying is “I’m worth hiring and would or could be good at this.” not “I am the platonic ideal of a human.” Selling your value isn’t lying, it’s being honest about your worth, which may require you working on liking yourself.

If you don’t have *any* experience and have made it to the interview stage, the person interviewing you either knows that or is an idiot. You should be good in either case, although I would recommend not working for idiots. It is in your best interest to be honest (generally, but especially in this case) about your lack of experience (again, don’t volunteer answers to unasked questions) because it will be very easy to sniff out a lie – assuming you’re not with the idiot.

I feel like I’m under qualified for all the jobs I look at.

To be fair, this is a statement and not a question, but here’s the rub:

Don’t assume that everyone applying for or working in a job knows more or is more experienced than you.

“Think of how stupid the average person is, and realize half of them are stupider than that.” – George Carlin

Some of those stupid people have the job you want. They may have even got the job because of how oblivious they were to their own weaknesses and came off confident as fuck.

Regarding “job requirements”: ignore them past how they frame the nature of the role. In most cases they are an aspirational wish list that someone copy/pasted from someone else’s aspirational wish list. That folks often ask for more years of experience in a specific tech than the tech has existed is testament to the lack of thought that goes into most postings.

If you see a job description and immediately think “I’m qualified for this.” you are probably over-qualified or very senior and have an appropriate level of confidence.

When I write job descriptions I try to use the requirements to filter for “I need you to be somewhat familiar with this branch of tech.” vs. “I need someone with exactly 5 years of experience writing Terraform.”

On paper, I’ve never fully met the requirements for the roles I’ve been hired for. I don’t even have a college degree and sound like Hank the Cowdog when I talk. Don’t let the “requirements” stop you from applying for something.

I mean, they let me in. You’re probably way nicer than I am.

It says “senior”. I’m not a senior.

Again, not a question.

Here’s a secret. Job titles are bullshit. Ignore them.

Their functional purpose is almost entirely for creating pay bands. Just work hard, be kind, remove bottlenecks, and solve pain points.

Ignoring your title and the scope that it implies can carry your career a lot farther than trying to adhere to some rigid definition. Stretch your wings and get involved in whatever interesting work you come across. If you approach work that wouldn’t typically fit your title with humility and thoughtfulness, you will generally be welcome.

Summary: The stakes are low

If you’re reading this, it’s unlikely that you’re in a situation where getting a specific job is literally the difference between life or death. To paraphrase the mostly problematic founder of GoDaddy – No one is going to eat you if you fail.

Apply for roles that interest you. Pitch yourself with confidence. Learn something from every interview, whether you get the job or not.

And when you do get the job, reach a hand back down the ladder to help someone else up.

Photo by Andre Mouton on Unsplash


Which Kubernetes Container Probe Should I Use?

As you lean into your Kubernetes (k8s) cluster’s self-healing capabilities, one of the first concepts you’ll encounter is that of readiness, liveness, and startup probes. These can be a little confusing because there is a bit of overlap between them. There’s also a lot of risk in getting your configuration wrong, so it’s an area you’ll want to put significant thought towards.

What are probes in this context?

k8s probes are effectively abstractions of health checks. Each type has a different function, but they all provide your cluster with feedback on how it should handle a specific container when it is in various unhealthy/ready states.

Before you slap a bunch of probes on your deployment configuration, it’s important to understand the behavior of each probe type and whether you actually need it or not. Probes can affect one another, so layering them on may have unexpected effects.

You’ll already encounter strange problems in the simplest of dynamic environments, don’t add to your worries by copy/pasting in a bunch of probe configurations think you’re making everything better. Note: Finger pointed at self.

The probes all use identical configuration options (but should not be configured identically!), which does make things easier. They can be based on TCP, HTTP, or EXEC responses, making them very flexible and likely preventing you from having to write a lot of custom health check nonsense.

    port: 8080
  initialDelaySeconds: 15
    periodSeconds: 20

Startup Probes

Most configuration examples you come across won’t contain startup probes, because 1) they are newish (k8s 1.16), and 2) you likely don’t need one unless your application is old and janky or particularly slow to startup – which may be because it is old and janky.

Startup probes signal your cluster that the container in question has fully started and is ready to receive traffic.

Startup probes keep the other probe types from starting and potentially killing your slow-to-start app container. It’s important to note that they only run at container start and will either kick over responsibility to another probe (if successful) or kill your container (if they timeout).

They also allow more granular control, since they specifically target the pod/container startup, allowing you to configure the other probes to handle different container states.

That being said, readiness probes handle startup detection well and can generally be used in lieu of a startup probe.

Readiness Probes

You can usually get away with running only readiness probes. This probe type detects when a container is ready to receive traffic, which may include some condition past “Hi, I’ve started.” depending on your app.

Although the name might imply that the probe justs check for the initial ready state and be done, these probes actually run continuously based on the period you set. If they ever meet the configured failureThreshold, your cluster will stop directing traffic to the container until it is healthy again and meets the successThreshold.

You’ll want to be thoughtful with the timing you use for these as mistiming can lead to pods/containers cycling in and out of ready state.

Also, very important, the health check that your readiness probe targets should not rely on any external services to report health (i.e. a database connection). If your cluster or the external service encounters a network blip that it should recover from gracefully (and quickly), the readiness probe might put things in a state where your entire service is unable to serve traffic (or other bad things) because containers reported as temporarily unhealthy. This post has solid guidance on what to think through when configuring your readiness probes.

Liveness Probes

Liveness probes detect stalled containers (deadlocks) and trigger a restart of the container. Ideally, instead of configuring a liveness probe, you should just fix whatever is causing deadlocks, cuz deadlocks are bad.

Liveness and readiness probes can compete with one another so be careful in your implementation if you have to use one. Again, don’t use these unless you absolutely have to. Because your containers can get stuck in a restart loop, liveness probes have high potential to cause a cascading failure across your service (and possibly cluster).

Letting a container die and exit 1 is a much better solution. I won’t go as far as to say that you should never use a liveness probe, but you really need to be confident in the need and configuration due to the risk of what might happen if your liveness probe doesn’t behave as expected (or behaves as expected and drives some other unexpected behavior downstream).


Start with readiness probes and only use the other probe types if explicitly needed and you know what you’re doing.

Photo by Raul Petri on Unsplash

Tech TIL

Adventures in tuning unicorn for Kubernetes

There isn’t much detailed info in the wild about folks running Ruby on Rails on Kubernetes (k8s); the Venn diagram for the respective communities doesn’t have a ton of overlap. There are some guides that cover the basics, enough to get a lab up and running, but not much at all about running a production environment. The situation is getting better slowly.

The #kubernetes-ruby Slack channel has all of ~220 people in it versus the thousands found in channels for other languages. If you scroll through the history, you’ll find that most of the questions and responses cover Day 0/1 issues – “How do we make it run?”-type problems.

So, I thought it would be worthwhile to share a bit of my own experience trying to optimize a RoR deployment in Kubernetes, if only to save another Ops-person from stumbling through the same mistakes and dead-ends that I encountered.

Some background

The initial move of our RoR app from Heroku to Kubernetes was relatively straightforward. Much of the effort went into stress-testing the auto-scaling and resource config to find behavior that felt OK-enough to start.

Part of that tweaking required making our worker/container config less dense to be cluster-friendly and provide smooth scaling versus the somewhat bursty scale-up/down we had seen with running a high process count on Heroku dynos.

Generally, small, numerous containers running across many nodes is what you want in a clustered setting. This optimizes for resiliency and efficient container packing. It can also make scaling up and down very smooth.

We settled on an initial config of 2 unicorn (Rails app server) processes per container with 768MB of RAM and 1000 CPU millicores – maxing out at a few hundred k8s pods. This felt bad and abstracted traditional unicorn optimization practices up one level, where more workers per unicorn socket = better routing efficiency, but it seemed to perform OK and denser configs (confusingly) appeared to have worse performance within the cluster. It also jived with the limited documentation we could find for other folks making similar migrations.

The initial goal was satisfied – get it moved with limited downtime and acceptable performance, where “acceptable” was basically “what it was doing on Heroku”. In fact, it seemed to be running a bit better than it had on Heroku.

Slower than it should be

Fast forward a year and tens of thousands more requests-per-minute. Our team decided we wanted to introduce service level indicators/objectives to inform product and infrastructure work. We chose a target for latency and started tracking it. We were doing OK in regard to the target, but not where we felt like we could be (I, personally, wanted a bit more buffer room.), so we started digging in to the causes of slowness within the stack.

It immediately became apparent that we were blind to network performance across some layers of the stack. We were tracking app/database level latency and could derive some of the latency values for other tiers via logs, but the process was cumbersome and too slow for real-time iteration and config tweaking.

A co-worker noticed we were missing an X-Request-Start header in our APM telemetry. We added the config in our reverse-proxy (nginx) and discovered a higher-than-expected amount of request queuing between nginx and unicorn.

That kicked off a round of experiments with nginx, unicorn, and container configs. Some of these provided minor benefit. Boosting the number of app pods reduced the request queuing but also wasted a lot of resources and eventually plateaued. Increasing worker density was minimally successful. We went up to 3 workers w/ 1GB of RAM and saw better performance, but going past that yielded diminishing returns, even when increasing the pod request/limits in parallel.

Network captures weren’t very helpful. Neither were Prometheus metric scrapes (at least not to the degree that I was able to make sense of the data.). As soon as requests entered the k8s proxy network, we were blind to intra-cluster comms until it hit the pod on the other side of the connection. Monitoring the unicorn socket didn’t show any obvious problems, but all the symptoms signaled that there was a bottleneck between nginx and unicorn, what you would see if connections were stacking up on the unicorn socket. We couldn’t verify that was what the actual issue was though.

After investing quite a bit of time going down sysctl and other rabbit holes, we decided to set the problem aside again to focus on other work. We had yielded some performance improvements and everything was performing “OK” versus our SLO.

Give it the juice

One of the goals of the SLI/SLO paradigm is to take measurements as close to the customer as is practical. In the spirit of that, we moved our latency measurement from the nginx layer to the CDN. We had avoided the CDN measurement previously, because the time_taken value in AWS Cloudfront logs is problematic and sensitive to slow client connections. However, AWS recently added a origin_latency value to their logs, making this tracking more practical and consistent.

Once we made the switch, we were sad. Per the updated measurement point, we were performing much worse than expected the closer we got to the client. This kicked off a new round of investigation.

Much of the unexpected latency was due to geography and TLS-handshakes. We detailed out some of the other causes and potential mitigations, listing unicorn config improvements as one of them. I set the expectation for those improvements low given how much time we had already invested there, and how mixed the results were.

But we gave it another go.

This time around, we introduced linkerd, a k8s service mesh, into the equation. This gave us better visibility into intra-cluster network metrics. We were able to test/iterate in real-time in our sandbox environment

We performed some experiments swapping out unicorn with puma (We hadn’t done this previously due to concerns about thread-safety, but it was safe enough to test in the sandbox context.). Puma showed an immediate improvement versus unicorn at low traffic and quickly removed any doubt that there was a big bottleneck tied directly to unicorn.

We carved out some time to spin up our stress test infra and dug in with experiments at higher traffic levels. Puma performed well but also ran into diminishing returns pretty quickly when adding workers/threads/limits. While troubleshooting we noticed that if we set a CPU limit of 2000 millicores, the pods would never use more than 1000. Something was preventing multi-core usage.

That something turned out to be a missing argument that I’ve so far only found in one place in the official k8s docs and was missing from every example of a deployment config that I’ve come across to date.

apiVersion: v1
kind: Pod
  name: cpu-demo
  namespace: cpu-example
  - name: cpu-demo-ctr
    image: vish/stress
        cpu: "1"
        cpu: "0.5"
    - -cpus
    - "2"

“The args section of the configuration file provides arguments for the container when it starts. The -cpus "2" argument tells the Container to attempt to use 2 CPUs.”

Turns out, it doesn’t matter how many millicores you assign to a request/limit via k8s config if you don’t enable a corresponding number of cores via container argument. The reason this argument is so lightly documented in the context of k8s is that it has little to nothing to do with k8s. -cpus is related to Linux kernel cpuset and allows Docker (and other containerization tools) to configure the cgroup a container is running in with limit overrides or restrictions. I’ve never had to use it before, so I knew nothing about it.

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

(╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻ (╯°□°)╯︵ ┻━┻

So many tables flipped… not enough tables.

With higher worker count, CPU/RAM limits, AND a container CPU assignment override, unicorn actually performed better than puma (This can be the case in situations where your app is not I/O constrained). Almost all of our request queuing went away.

We eventually settled on 8 workers per pod with 2000/4000 CPU request/limit and 2048/3584 MB RAM request/limit as a nice compromise in density vs. resiliency and saw an average of 50ms improvement in our p95 response time. (It’s possible we’ll tweak this further as we monitor performance over time.

The issue had been a unicorn socket routing bottleneck the entire time, just as had been suspected earlier on. The missing piece was the CPU override argument.

Note: the default value for -cpus is ‘unlimited’. In our case something, like a host param, was overriding that.

What have we learned

There are a few things worth taking away here in my mind.

  1. Yet again, k8s is not a panacea. It will not solve all your problems. It will not make everything magically faster. In fact, in some cases, it may make performance worse.
  2. Before you run out and start installing it willy-nilly, Linkerd (or service meshes in general) will not solve all your problems either. It was helpful in this context to enable some troubleshooting, but actually caused problems during stress testing when we saturated the linkerd proxy sidecars, which in turn caused requests to fail entirely. I ended up pulling the sidecar injection during testing rather than fiddling with additional resource allocation to make it work correctly.
  3. For all the abstraction of the underlying infrastructure that k8s provides, at the end of the day, it’s still an app running on an OS. Knowledge and configuration of the underlying stack remains critical to your success. You will continually encounter surprises where your assumptions about what is and is not handled by k8s automagic are wrong.
  4. Layering of interdependent configurations can be a nightmare to troubleshoot and can make identifying (and building confidence in) root causes feel almost impossible. Every layer you add increases complexity and difficulty exponentially. Expertise in individual technologies (nginx, unicorn, Linux, k8s, etc) helps, but isn’t enough. Understanding how different configurations interact with one another across different layers in different contexts presents significant challenges.

Photo by Joen Patrick Caagbay on Unsplash

People Tech

You are not your code

You… are not your code. You are not your configs, your documentation, your blog posts, your conference talks, or your job.

You. Are. Not. Your. Work.

I say this as someone who spent many years of my life deriving personal value from the work I produced. It started in angst-y teenage writing – “No one understands me. These words are my true self.” – one of the dumbest of youthful ideas, possibly only surpassed by “I need to hurt to create.”

I took it into my career. “I am only good if I do a good job.” And I spent years feeling like crap about myself because my work was never as good as I felt it should be. I took other’s criticism “well”, but only because I absorbed it and later amplified it through my own lens. If something was wrong with my work, it was a personal flaw and I needed to fix it. Note taken.

This is bullshit. The sooner you internalize this and integrate it into your everyday, the better. Your job is a role you play, a hat you put on, a thing you do. The things you produce – code, writing, whatever, are at best a watered down reflection of your thoughts at the time – an artifact. (Worth noting: you are also not your thoughts. They are just things you temporarily have, but that’s a much bigger can of worms.)

Your value as a person is independent of all these things. You can produce garbage work and be an amazing person. You can produce amazing work and be a less than amazing person.

We grow up hearing “actions speak louder than words” and wrap that idea around our work reflecting who we are. I don’t think that’s the intent of the message, which has more to do with relationships and personal interaction than work.

As we develop in our craft and inherently pour more of what we view as our “selves” into our work, it becomes harder to keep those ideas separate. We learn to handle criticism from peers and managers, through code review, general feedback, or editing, by developing “thicker skin”, which in most cases just means walling ourselves off. This is the wrong metaphor and ineffective anyway. Thick skin won’t protect you from yourself. Very few of us are taught how to handle that criticism.

Turns out, rather than thick skin, you might be better off with a spam filter, something that can differentiate between commentary on your work and commentary on you. Something that can detect the intent, spirit, and context of feedback, internal or otherwise, and classify it accordingly. “These messages get delivered to the inbox to be considered for future analysis. These other, not-useful messages get delivered to the tire fire next to the empty field where my fucks grow.”

We (tech people) are generally terrible at these distinctions. Part of it is due to the visibility of our work among peers, part of it is rampant workaholism, part of it is the isolation inherent to “deep work”, part of it is that many of us belong to a generation of emotionally damaged latch-key kids. There are a multitude of reasons and they manifest themselves in a strange form of conflict aversion where we have massive reactions to criticisms of our work and don’t push back at all to actual personal attacks.

I haven’t mastered this. I struggle every day but I keep trying because I’ve felt the truth of it and the alternative is unsustainable (Like, literally, I would die.). When you’re able to have a clearer distinction between yourself and your work, and in turn, commentary on yourself or your work, things stop feeling so bad.

You don’t need a daemon to make you feel like a bad person for writing bad (whatever that actually means) code or not performing up to some standard (yours or others’) at work. Shut that shit down.

Be proud of your wins. Learn from the misses and then set them aside. If doing “good” work is actually important to you, you’ll be much better enabled if you can redirect all that emotional energy you’ve been spending criticizing yourself or dealing with others’ criticism by funneling it into being kind to yourself and seeking value elsewhere.

Photo by Daniil Kuželev on Unsplash


You probably shouldn’t be using Kubernetes for your new startup

Kubernetes (k8s) is awesome. I am a fan. I like it so much I have a k8s-themed license plate on my car, but I cringe every time I see a blog post or tweet pitching it as a solution for a nascent company.

Like microservices (Which, strong opinion incoming… you also probably shouldn’t be using, especially for something new. Seriously, stop. ), k8s solves a specific set of problems (mostly coordination/abstraction of some infrastructure components/ deployments and a lot of stuff related to general scaling/self-healing) and comes with significant, usually overlooked cost.

From a sysadmin perspective, k8s is borderline magic. Compared to all the bespoke automation one might have had to build in the past, k8s provides a general purpose infrastructure-as-code path that *just works*. A metaphor of Lego-like bricks that snap together is apt… for the most part.

K8s abstracts away a huge amount of complexity that normally consumes the lives of sysadmins , but the complexity is still there. It’s just gracefully hidden 95% of the time and the way it bubbles up is binary. Problems in k8s are either incredibly easy to solve or incredibly difficult. There’s not much in-between. You’re either building with Lego or troubleshooting polymers at the molecular level.

Deploying a completely new service, configured from ingress to database in an easy-to-read YAML file? – Super simple.

Understanding the interplay of infra-service network optimizations and failure modes? – Even with tools like service meshes and advanced monitoring/introspection, it’s really difficult.

Cluster security, networking controls, third-party plugins? Now you’re in deep, specific-domain-knowledge land.

Hosted-k8s (EKS, AKS, GKE, etc.) does not solve these problems for you. **Caveat: I know there are some fully-managed k8s providers popping up, but the best of those are basically Platform-as-a-Service (PaaS). ** It solves a lot of other problems related to the care and feeding of the k8s control plane, but you’re still left with complexity that’s inherent to running services within a cluster. Even if you’re a ninth-level Linux witch, there are problems that arise when running clustered infrastructure at scale that are simply *hard* in a similar (but admittedly less-complex) way that neuroscience is hard.

There is a point at which the challenge of this hidden complexity begins to be outweighed by the benefits of k8s, but it’s pretty far down the road – we’re talking many-millions-of-requests-per-day-with-several-tiers/services-and-possibly-geographies territory. Or you’re in a very specific niche that requires complex auto-scaling machine learning fleets, or something similar.

This is not intended as fear mongering. Again, I use k8s everyday and think it is awesome, but you need to go into it with eyes wide open and only after you’ve leaned hard into the constraints of PaaS or more traditional, boring tech that you fully grok. I started using k8s with this perspective, (at least I think I did) and there were still surprises along the way. It’s not a panacea. It’s highly unlikely that using k8s is going to save your company. Like most technologies, it will cause as many problems as it solves, you just need to have a solid understanding and rationale around which set of problems you want and are capable of dealing with.

If you’re building a new company or product, troubleshooting k8s is likely not in one of the problem sets you should be dealing with. Use Heroku or Elastic Beanstalk or whatever else that takes care of the undifferentiated heavy lifting for you. You can circle back to k8s when things really start cooking and you’ve got the people and resources to keep things on track.

None of this is to say you shouldn’t learn k8s or play around with minikube in development. Just keep in mind the huge difference between mastering k8s on your local machine and operationalizing it in production.

You could replace “k8s” with pretty much any technology and I think this advice would still apply. If you’re building something new, focus on the things that really move the needle and don’t try to solve architectural problems that you don’t have.

Photo by Frank Eiffert on Unsplash

Tech TIL

TIL: How to live-rotate PostgreSQL credentials

OK, I didn’t actually learn this today, but it wasn’t that long ago.

Postgres creds rotation is straightforward with the exception of the PG maintainers deciding in recent years that words don’t mean anything while designing their identity model. “Users” and “Groups” used to exist in PG, but were replaced in version 8.1 with the “Role” construct.

Here’s a map to translate PG identifies to a model that will make sense for anyone who is familiar with literally any other identity system.

PostgresLiterally anything else

Now that we’ve established this nonsense, here’s a way of handling live creds rotation.

CREATE ROLE user_group; -- create a role, give it appropriate grants.


CREATE ROLE user_green WITH ENCRYPTED PASSWORD 'REPLACE ME AS WELL' IN ROLE user_group nologin; -- This one isn't being used yet, so disable the login.

That gets you prepped. When you’re ready to flip things.

ALTER USER user_green WITH PASSWORD 'new_password' login;

Update the creds wherever else they need updating, restart processes, confirm everything is using the new credentials, etc. Then

ALTER USER user_blue WITH PASSWORD 'new_password_2' nologin;

Easy, peasy.


Kubernetes EC2 autoscaling for fun and profit

I’m drawn to the puzzles of distributed systems and abstracted platforms – the problems that only crop up when lots of moving pieces work in tandem (or not!).

I recently encountered one of these issues a few weeks after a platform migration to AWS EKS.

The symptoms

The initial problem manifested itself as an application async worker issue.

  1. Async process queues began stacking up and triggering alerts.
  2. Investigation of the worker process revealed that:
    • Workers reported healthy
    • Workers seemed to be processing the maximum number of threads per worker
    • Workers were using minimal compute resources
    • Some of the queue was getting processed
  3. Re-deploying the async worker Kubernetes (k8s) pods resolved the immediate problem and the queues started draining again.

Our core app reported a small number of failed database requests at the same time that queues started stacking. This pointed us at the network and our DB connection pooler, pgbouncer, both of which looked fine. However, a couple of pgbouncer k8s pods had migrated to different k8s nodes a few minutes before we saw the queue issue.

This got us looking at node autoscaling. The node the migrated pgbouncer pods were running on had been autoscaled down, forcing their restart on another node. This is expected behavior. It was, however, unexpected that our async workers’ connections to pgbouncer wouldn’t time out and attempt a re-connect.

The async worker threads were holding open connections that would never complete or fail, stalling them out and preventing them from processing new queue items.

Attempts to fix

Wanting to lean into k8s’ transience and statelessness we approached the problem from a few angles:

  1. Liveness probes w/ a DB connection health check – We already had these configured for pgbouncer and most other components but not for our async workers. Thinking through this option, we suspected there would be issues with false negatives so decided to put it on the back burner.
  2. Async worker DB connection time outs – This had global app ramifications as it required reconfiguration of the pg gem or Rails ActiveRecord, both of which felt like bad options and turned out to actually be pretty gnarly when tested.
  3. Configure a k8s container lifecycle hook for pgbouncer. This was already in place but didn’t appear to be working consistently.
  4. Set up a dedicated node pool just for pgbouncer and disable autoscaling. This seems to be what most people running pgbouncer in k8s are doing, but it felt bad philosophically, so we set it aside as a last resort.

Most effort focused on the lifecycle hook option. Making sure pgbouncer received a SIGINT instead of a SIGTERM let it close out running connections safely and reject new connections. This looked like it was going to solve the problem.

It did not.

Sadness… then hope

This issue plagued us for a few weeks while we worked on higher priority items and performed research / testing. The problem was consistently tied to autoscaling down and happened at roughly the same time, but didn’t occur every day or every time a pgbouncer-hosting node scaled down.

Having run out of attractive options, we built out a dedicated node pool for pgbouncer and began testing it in QA. Again, this felt bad – adding a static component to a dynamic architecture.

Prior to deploying to production, we had another queue backup.

We looked at the pgbouncer logs during the last autoscaling event and noticed that neither SIGINT or SIGTERM were getting called via the lifecycle preStop hook for the container. Then we looked at how the node was getting autoscaled and compared to an event where SIGINT was issued (and the preStop hook did trigger).

When the k8s cluster autoscaler was responsible for autoscaling a node down, SIGINT was sent and pgbouncer shut down gracefully. When AWS autoscaling performed a rebalance (making sure an equal number of instances is running in each availability zone), neither SIGINT or SIGTERM were sent to the pgbouncer pod and it died gracelessly.

This explained why the issue had been inconsistent – it only happened after the k8s cluster autoscaler scaled down and then the AWS autoscaler performed a rebalance across availability zones and just happened to pick a node with pgbouncer on it.

Turns out, this is a known, if lightly documented issue. Spelunking in the Kubernetes Autoscaler docs revealed:

Cluster autoscaler does not support Auto Scaling Groups which span multiple Availability Zones; instead you should use an Auto Scaling Group for each Availability Zone and enable the –balance-similar-node-groups feature. If you do use a single Auto Scaling Group that spans multiple Availability Zones you will find that AWS unexpectedly terminates nodes without them being drained because of the rebalancing feature.

Which was our exact scenario. Derp.

The workarounds people are using:

  1. The thing in the docs – creating separate AWS autoscaling groups in each AZ and letting the k8s cluster autoscaler handle balancing with the –balance-similar-node-groups flag . This is kind of ugly and introduces more scaling and load-balancer complexity on the AWS side.
  2. Creating a node drain Lambda tied to an AWS autoscaling lifecycle hook that pauses the termination and issues drain commands on the node that’s being scaled down to let pods migrate gracefully. There is almost zero documentation for this solution. An example was added to the AWS-Samples repo about a week prior to us discovering the root cause.

“How did I get here?”

It’s pretty simple, at least in hindsight.

There is much about Kubernetes that feels automagical, especially when you’re relatively new to it. Some of that is by design, it’s meant to abstract away a lot of the pain/details of the underlying infrastructure. In this case, I trusted too much in the automagicalness of the cluster autoscaler and assumed it was more tightly integrated into AWS than it currently is.

This is not a gripe, more a testament to how impressive the Kubernetes ecosystem is overall. It generally is automagical, so it’s easy to make assumptions. This is just an edge case the automagic hasn’t reached yet.

Learning People Tech

What I love about SRE

My childhood was soaked in science. As I learned the alphabet and how to tie my shoes, my dad spent his days taking water samples and caring for the fish that made up the research cohort for the aquaculture study he ran. We lived at a research site on the lake and I toddled along through three hot summers, staring into the eyes of whiskered catfish and witnessing the hard, mundane work of science interwoven through our daily lives.

I did a search recently and the only monument to that time I can find is an eight page document that basically says “meh”.

One morning, years later, I sat in my dad’s lab injecting nematodes into hundreds of tiny, clear Dixie cups full of dirt samples, some of which would later be paired with marigold extracts. It wasn’t the most exciting “Take your son to work” Day, but once I developed a cadence there was a calming quality to it and time melted away.

It was more interesting to me as an adult, when I learned this type of research, as boring and un-sexy as it is, impacts whether millions of people get enough to eat.

Our living room and porch were filled with hybridization experiment rejects – peppers, squash, and random erosion-control plants in an assortment that would in no way be considered normal by anyone who actually raised house plants. They were misfits that didn’t have the right taste, shape, structure, or hardiness to make it to the next round and the smell of their potting soil and green of their leaves transformed our house into a primordial jungle. For all of my dad’s commitment to the logic of science, a bit of animism also threaded through his work. He’d feel bad if he had tilled these plants back into the dirt or tossed them into an incinerator.

These “failures” were each data points and lessons. Some of those lessons were “don’t touch this and then rub your eyes”.

All of these objects and experiences embedded a system of discovery in me (Some might call it a method 😜.): make a guess -> try to prove your guess wrong and measure the results -> analyze and iterate. This method is a tool that helps reveal the fabric of reality. It’s the best thing humans have ever come up with.

Growing up surrounded by the practices of science taught me to find interest and beauty in the outwardly mundane, that there was opportunity, even in the most boring-seeming places, to discover something that no one else in the whole world knows – at least for a brief moment.

This kind of childhood inspired me to be curious and persistent. Other aspects of growing up weren’t great but this part of growing up was as close to perfect as I can imagine and I am grateful for it.

My career has meandered its way not into the biological or physical sciences, but into something we’re currently calling site reliability engineering – a strange amalgam of systems administration, performance management, software development, quality control, and the crafting of antifragility, a practice I don’t really know what to call other than “applied statistics”. In the narrowest view, SRE can be limited to a fancier name for release management, but in most organizations there is runway to make it much more.

As with any maturing discipline, people find different areas of focus, but the aspects of SRE that appeal most to me are those that mirror what I saw growing up, areas where the scientific method can be leveraged to chip away at problems that have, up until very recent history, only been attacked with intuition and business process consultations.

SRE doesn’t hold a monopoly on this approach. Anyone can start challenging assumptions with “What do we think is true and how would we know we’re wrong?” questions, but there are some unique, SRE-specific opportunities for experimentation at scale and within the distributed systems that SREs manage. And because of its inherent technical nature and practitioners’ comfort with data, SRE (along with data engineering and finance) provides a good beach head for science to wiggle its way into the rest of a business.

Science manifests itself in SRE in expressions as simple as “How do we measure and increase reliability? When and where do we encounter diminishing returns?” That’s a good place to start, but not where anyone should stop. Continuing the line of questioning of “What matters to us and how do we keep ourselves honest?” provides a lot of opportunities to provide value and make interesting discoveries.

Questions you ask could lead you to dig into cloud provider bills, or analyze access patterns to blob storage, or purposefully inject failure into systems to find their weaknesses. Managing servers is part of the job in a similar way to my dad having to feed the fish he was studying. It’s a base requirement, but it’s not the point.

The really interesting opportunities in SRE present themselves when you open yourself up to a broader definition for your role, what questions you should be asking and to whom. Thinking more broadly than what you need to do to address the bottom tiers of Maslow’s hierarchy of needs for systems operations allows you to affect change and make useful discoveries. This is where I thrive and find the things I love about doing SRE work – having real things to measure, make decisions about, and improve through a methodology that requires you to be honest about the world you live and work within.