Categories
Tech

Monitor available IPs with Lambda and Cloudwatch

I ran into a situation where I needed to keep track of available IPs related to an AWS EKS cluster and couldn’t find any off the shelf tooling in AWS or otherwise to do so.

Tangential gripe: The reason I needed the monitor is because EKS doesn’t support adding subnets to a cluster without re-creating it and the initial subnets that were used were a little too small due to reasons. I wanted to have a sense of how much runway I had pending AWS fixing the gap or me implementing a work around.

So, I cobbled together a Lambda function to pull the info and pipe it into Cloudwatch.

Gist here

I’m using tags to scope the subnets I want to track, rather than piping in everything – since Cloudwatch custom metrics cost money. But you could use whatever filters you wanted.

After getting the data into Cloudwatch, I was quickly reminded that you can’t alarm off of multiple metrics, so I used a math expression (MIN) to group them instead. This works for up to 10 metrics (this post should really be titled “The numerous, random limitations of AWS”), which luckily worked for me in this instance.

Then I setup an alarm for the threshold I wanted and tested it – it worked. Fun times.

Photo by Taylor Kopel on Unsplash

Categories
Tech

Do NOT use wildcard alternate domains in AWS CloudFront

CloudFront configs allow for alternate domain names if you’d like to use a custom domain for your CDN distribution. You likely want that.

The alternate domain list can include wildcard subdomains, like *.example.com – see the docs. This is handy for dev and experiment environments so you don’t have to be constantly updating the config.

Imagine you have a scenario where you have a Route53 (DNS) & CloudFront config that looks like:

example.com - ALIAS - CloudFrontDistroA 
(where example.com is defined as an alias)

www.example.com - CNAME - example.com

random.example.com - ALIAS - CloudFrontDistroB 
(where *.example.com is defined as an alias)

What would you expect would happen when traffic went to www.example.com ?

If you’re not an insane person, you’d likely expect it to be directed to example.com, because that’s how DNS CNAMEs work.

You’ve probably guessed by now that this isn’t what actually happens.

Per AWS support, when traffic is routed to a CloudFront point of presence (POP), it will ignore the DNS config and route to the most specific rule across the CloudFront distributions in your account.

So, *.example.com will not only catch ANY example.com CNAME that doesn’t have an explicit mapping across all of your distributions’ alternate domain lists. It will also route ANY example.com CNAME there taht doesn’t have an explicit mapping. That makes a sort of sense if you’re thinking about a wildcard’s behavior in isolation. The first part at least, the second part is rubbish.

What this means in practice though is that one distribution’s configuration may affect another distribution in unexpected ways.

To AWS’ credit, this is documented… barely and for the inverse scenario.

If you have overlapping alternate domain names in two distributions, CloudFront sends the request to the distribution with the more specific name match, regardless of the distribution that the DNS record points to. For example, marketing.domain.com is more specific than *.domain.com.

I could see a use case for wanting an optional catch-all. I cannot fathom why anyone would desire this as the hard-coded default behavior. It completely fails the Principal of Least Surprise.

So my friends, do not use wildcard subdomains in your CloudFront Alternate Domain config. Learn from my mistakes. I was a fool for believing we could have nice things.

Categories
People Tech

Frequently asked tech career questions

I’ve reached the point in my career and have enough gray hair that people sometimes ask me for career advice. Usually they are in tech, trying to make a transition into a different role, or are trying to get into tech. I do my best to answer these questions because it is obvious that they have experienced some sort of trauma and/or have run out of better options if they are at the point where they can look into the melancholy and barely obscured madness behind my eyes and think “This is someone who can help me.”

To save others from having to brave the same journey, here is a small FAQ that reflects the common themes of those discussions.

How do I get started in X?

You just do the thing and keep doing it until someone is willing to pay you money to do it.

You become a writer by writing.

You become an engineer by engineering.

Unless you need a license (*I* won’t judge how you learned to remove appendixes, but others might.), you just need to start doing the thing.

This is not the answer people want or are looking for. But it really is that simple. There are no shortcuts or lifehacks. Simple != easy and that sometimes makes our brains sad. I’m sorry.

What tech do I need to learn to do X?

There are admittedly different fundamentals required for different types of jobs, but if you’re asking about specific technologies the answer is that it doesn’t really matter.

Python or Ruby? Doesn’t matter.

AWS or Azure? If you can figure out one, you can figure out the other.

If you have a specific role in mind, do some research on the thematic pillars of that role and pick 1-2 pieces of tech in each pillar to focus on. Do not go down the rabbit hole of infinite research trying to decide on the perfect lineup of tech to learn. That’s not a thing. You’ll just delay doing anything useful.

There is one “trick” I recommend here. People have mixed opinions about certifications, but if you can use studying for a target certification as a learning path, go for it. It may focus you a bit.

When is it OK to list X on my resume?

Have you made a thing or two that worked using X and can speak about it in the appropriate context? OK then, go for it unless the job was explicitly defined by X, like “Unity Developer” or “AutoCAD Tech”. The ancillary stuff is fine to list if it is relevant to the role and you have actually used it.

Is X just a word you saw somewhere? Do not put it on your resume. I generally don’t delight in making people squirm in interviews, but I will create a “learning moment” that may be uncomfortable if I catch someone keyword stuffing.

How do I answer interview questions about things I haven’t done without lying?

I have never lied in a job interview. I have, however, re-framed my experience in the context of the job I was applying for.

When it comes to the actual interview where you’re asked “Have you done X using Y?” and the answer is “No”, try going with “I’ve done A using B which is very similar to Y” – shifting the focus to how you solved a similar problem rather than the specific one they asked about. Granted, you can’t pull this trick for every question without coming off as evasive, but it’s a legit tool.

It’s also OK to not volunteer unprompted caveats like “I’ve done X… but never in production.” They don’t need to know that. You aren’t marrying this person. Just say “Yes” and move on.

Nailing an interview is more about being able to sell yourself than proving what you know, with exception of coding interviews which are a crap shoot (I prefer take home exercises.). The “selling” part feels icky to a lot of us, but it shouldn’t. All you are saying is “I’m worth hiring and would or could be good at this.” not “I am the platonic ideal of a human.” Selling your value isn’t lying, it’s being honest about your worth, which may require you working on liking yourself.

If you don’t have *any* experience and have made it to the interview stage, the person interviewing you either knows that or is an idiot. You should be good in either case, although I would recommend not working for idiots. It is in your best interest to be honest (generally, but especially in this case) about your lack of experience (again, don’t volunteer answers to unasked questions) because it will be very easy to sniff out a lie – assuming you’re not with the idiot.

I feel like I’m under qualified for all the jobs I look at.

To be fair, this is a statement and not a question, but here’s the rub:

Don’t assume that everyone applying for or working in a job knows more or is more experienced than you.

“Think of how stupid the average person is, and realize half of them are stupider than that.” – George Carlin

Some of those stupid people have the job you want. They may have even got the job because of how oblivious they were to their own weaknesses and came off confident as fuck.

Regarding “job requirements”: ignore them past how they frame the nature of the role. In most cases they are an aspirational wish list that someone copy/pasted from someone else’s aspirational wish list. That folks often ask for more years of experience in a specific tech than the tech has existed is testament to the lack of thought that goes into most postings.

If you see a job description and immediately think “I’m qualified for this.” you are probably over-qualified or very senior and have an appropriate level of confidence.

When I write job descriptions I try to use the requirements to filter for “I need you to be somewhat familiar with this branch of tech.” vs. “I need someone with exactly 5 years of experience writing Terraform.”

On paper, I’ve never fully met the requirements for the roles I’ve been hired for. I don’t even have a college degree and sound like Hank the Cowdog when I talk. Don’t let the “requirements” stop you from applying for something.

I mean, they let me in. You’re probably way nicer than I am.

It says “senior”. I’m not a senior.

Again, not a question.

Here’s a secret. Job titles are bullshit. Ignore them.

Their functional purpose is almost entirely for creating pay bands. Just work hard, be kind, remove bottlenecks, and solve pain points.

Ignoring your title and the scope that it implies can carry your career a lot farther than trying to adhere to some rigid definition. Stretch your wings and get involved in whatever interesting work you come across. If you approach work that wouldn’t typically fit your title with humility and thoughtfulness, you will generally be welcome.

Summary: The stakes are low

If you’re reading this, it’s unlikely that you’re in a situation where getting a specific job is literally the difference between life or death. To paraphrase the mostly problematic founder of GoDaddy – No one is going to eat you if you fail.

Apply for roles that interest you. Pitch yourself with confidence. Learn something from every interview, whether you get the job or not.

And when you do get the job, reach a hand back down the ladder to help someone else up.

Photo by Andre Mouton on Unsplash

Categories
Tech

Which Kubernetes Container Probe Should I Use?

As you lean into your Kubernetes (k8s) cluster’s self-healing capabilities, one of the first concepts you’ll encounter is that of readiness, liveness, and startup probes. These can be a little confusing because there is a bit of overlap between them. There’s also a lot of risk in getting your configuration wrong, so it’s an area you’ll want to put significant thought towards.

What are probes in this context?

k8s probes are effectively abstractions of health checks. Each type has a different function, but they all provide your cluster with feedback on how it should handle a specific container when it is in various unhealthy/ready states.

Before you slap a bunch of probes on your deployment configuration, it’s important to understand the behavior of each probe type and whether you actually need it or not. Probes can affect one another, so layering them on may have unexpected effects.

You’ll already encounter strange problems in the simplest of dynamic environments, don’t add to your worries by copy/pasting in a bunch of probe configurations think you’re making everything better. Note: Finger pointed at self.

The probes all use identical configuration options (but should not be configured identically!), which does make things easier. They can be based on TCP, HTTP, or EXEC responses, making them very flexible and likely preventing you from having to write a lot of custom health check nonsense.

livenessProbe:
  tcpSocket:
    port: 8080
  initialDelaySeconds: 15
    periodSeconds: 20

Startup Probes

Most configuration examples you come across won’t contain startup probes, because 1) they are newish (k8s 1.16), and 2) you likely don’t need one unless your application is old and janky or particularly slow to startup – which may be because it is old and janky.

Startup probes signal your cluster that the container in question has fully started and is ready to receive traffic.

Startup probes keep the other probe types from starting and potentially killing your slow-to-start app container. It’s important to note that they only run at container start and will either kick over responsibility to another probe (if successful) or kill your container (if they timeout).

They also allow more granular control, since they specifically target the pod/container startup, allowing you to configure the other probes to handle different container states.

That being said, readiness probes handle startup detection well and can generally be used in lieu of a startup probe.

Readiness Probes

You can usually get away with running only readiness probes. This probe type detects when a container is ready to receive traffic, which may include some condition past “Hi, I’ve started.” depending on your app.

Although the name might imply that the probe justs check for the initial ready state and be done, these probes actually run continuously based on the period you set. If they ever meet the configured failureThreshold, your cluster will stop directing traffic to the container until it is healthy again and meets the successThreshold.

You’ll want to be thoughtful with the timing you use for these as mistiming can lead to pods/containers cycling in and out of ready state.

Also, very important, the health check that your readiness probe targets should not rely on any external services to report health (i.e. a database connection). If your cluster or the external service encounters a network blip that it should recover from gracefully (and quickly), the readiness probe might put things in a state where your entire service is unable to serve traffic (or other bad things) because containers reported as temporarily unhealthy. This post has solid guidance on what to think through when configuring your readiness probes.

Liveness Probes

Liveness probes detect stalled containers (deadlocks) and trigger a restart of the container. Ideally, instead of configuring a liveness probe, you should just fix whatever is causing deadlocks, cuz deadlocks are bad.

Liveness and readiness probes can compete with one another so be careful in your implementation if you have to use one. Again, don’t use these unless you absolutely have to. Because your containers can get stuck in a restart loop, liveness probes have high potential to cause a cascading failure across your service (and possibly cluster).

Letting a container die and exit 1 is a much better solution. I won’t go as far as to say that you should never use a liveness probe, but you really need to be confident in the need and configuration due to the risk of what might happen if your liveness probe doesn’t behave as expected (or behaves as expected and drives some other unexpected behavior downstream).

tl;dr

Start with readiness probes and only use the other probe types if explicitly needed and you know what you’re doing.

Photo by Raul Petri on Unsplash

Categories
People Tech

You are not your code

You… are not your code. You are not your configs, your documentation, your blog posts, your conference talks, or your job.

You. Are. Not. Your. Work.

I say this as someone who spent many years of my life deriving personal value from the work I produced. It started in angst-y teenage writing – “No one understands me. These words are my true self.” – one of the dumbest of youthful ideas, possibly only surpassed by “I need to hurt to create.”

I took it into my career. “I am only good if I do a good job.” And I spent years feeling like crap about myself because my work was never as good as I felt it should be. I took other’s criticism “well”, but only because I absorbed it and later amplified it through my own lens. If something was wrong with my work, it was a personal flaw and I needed to fix it. Note taken.

This is bullshit. The sooner you internalize this and integrate it into your everyday, the better. Your job is a role you play, a hat you put on, a thing you do. The things you produce – code, writing, whatever, are at best a watered down reflection of your thoughts at the time – an artifact. (Worth noting: you are also not your thoughts. They are just things you temporarily have, but that’s a much bigger can of worms.)

Your value as a person is independent of all these things. You can produce garbage work and be an amazing person. You can produce amazing work and be a less than amazing person.

We grow up hearing “actions speak louder than words” and wrap that idea around our work reflecting who we are. I don’t think that’s the intent of the message, which has more to do with relationships and personal interaction than work.

As we develop in our craft and inherently pour more of what we view as our “selves” into our work, it becomes harder to keep those ideas separate. We learn to handle criticism from peers and managers, through code review, general feedback, or editing, by developing “thicker skin”, which in most cases just means walling ourselves off. This is the wrong metaphor and ineffective anyway. Thick skin won’t protect you from yourself. Very few of us are taught how to handle that criticism.

Turns out, rather than thick skin, you might be better off with a spam filter, something that can differentiate between commentary on your work and commentary on you. Something that can detect the intent, spirit, and context of feedback, internal or otherwise, and classify it accordingly. “These messages get delivered to the inbox to be considered for future analysis. These other, not-useful messages get delivered to the tire fire next to the empty field where my fucks grow.”

We (tech people) are generally terrible at these distinctions. Part of it is due to the visibility of our work among peers, part of it is rampant workaholism, part of it is the isolation inherent to “deep work”, part of it is that many of us belong to a generation of emotionally damaged latch-key kids. There are a multitude of reasons and they manifest themselves in a strange form of conflict aversion where we have massive reactions to criticisms of our work and don’t push back at all to actual personal attacks.

I haven’t mastered this. I struggle every day but I keep trying because I’ve felt the truth of it and the alternative is unsustainable (Like, literally, I would die.). When you’re able to have a clearer distinction between yourself and your work, and in turn, commentary on yourself or your work, things stop feeling so bad.

You don’t need a daemon to make you feel like a bad person for writing bad (whatever that actually means) code or not performing up to some standard (yours or others’) at work. Shut that shit down.

Be proud of your wins. Learn from the misses and then set them aside. If doing “good” work is actually important to you, you’ll be much better enabled if you can redirect all that emotional energy you’ve been spending criticizing yourself or dealing with others’ criticism by funneling it into being kind to yourself and seeking value elsewhere.

Photo by Daniil Kuželev on Unsplash

Categories
Tech

You probably shouldn’t be using Kubernetes for your new startup

Kubernetes (k8s) is awesome. I am a fan. I like it so much I have a k8s-themed license plate on my car, but I cringe every time I see a blog post or tweet pitching it as a solution for a nascent company.

Like microservices (Which, strong opinion incoming… you also probably shouldn’t be using, especially for something new. Seriously, stop. ), k8s solves a specific set of problems (mostly coordination/abstraction of some infrastructure components/ deployments and a lot of stuff related to general scaling/self-healing) and comes with significant, usually overlooked cost.

From a sysadmin perspective, k8s is borderline magic. Compared to all the bespoke automation one might have had to build in the past, k8s provides a general purpose infrastructure-as-code path that *just works*. A metaphor of Lego-like bricks that snap together is apt… for the most part.

K8s abstracts away a huge amount of complexity that normally consumes the lives of sysadmins , but the complexity is still there. It’s just gracefully hidden 95% of the time and the way it bubbles up is binary. Problems in k8s are either incredibly easy to solve or incredibly difficult. There’s not much in-between. You’re either building with Lego or troubleshooting polymers at the molecular level.

Deploying a completely new service, configured from ingress to database in an easy-to-read YAML file? – Super simple.

Understanding the interplay of infra-service network optimizations and failure modes? – Even with tools like service meshes and advanced monitoring/introspection, it’s really difficult.

Cluster security, networking controls, third-party plugins? Now you’re in deep, specific-domain-knowledge land.

Hosted-k8s (EKS, AKS, GKE, etc.) does not solve these problems for you. **Caveat: I know there are some fully-managed k8s providers popping up, but the best of those are basically Platform-as-a-Service (PaaS). ** It solves a lot of other problems related to the care and feeding of the k8s control plane, but you’re still left with complexity that’s inherent to running services within a cluster. Even if you’re a ninth-level Linux witch, there are problems that arise when running clustered infrastructure at scale that are simply *hard* in a similar (but admittedly less-complex) way that neuroscience is hard.

There is a point at which the challenge of this hidden complexity begins to be outweighed by the benefits of k8s, but it’s pretty far down the road – we’re talking many-millions-of-requests-per-day-with-several-tiers/services-and-possibly-geographies territory. Or you’re in a very specific niche that requires complex auto-scaling machine learning fleets, or something similar.

This is not intended as fear mongering. Again, I use k8s everyday and think it is awesome, but you need to go into it with eyes wide open and only after you’ve leaned hard into the constraints of PaaS or more traditional, boring tech that you fully grok. I started using k8s with this perspective, (at least I think I did) and there were still surprises along the way. It’s not a panacea. It’s highly unlikely that using k8s is going to save your company. Like most technologies, it will cause as many problems as it solves, you just need to have a solid understanding and rationale around which set of problems you want and are capable of dealing with.

If you’re building a new company or product, troubleshooting k8s is likely not in one of the problem sets you should be dealing with. Use Heroku or Elastic Beanstalk or whatever else that takes care of the undifferentiated heavy lifting for you. You can circle back to k8s when things really start cooking and you’ve got the people and resources to keep things on track.

None of this is to say you shouldn’t learn k8s or play around with minikube in development. Just keep in mind the huge difference between mastering k8s on your local machine and operationalizing it in production.

You could replace “k8s” with pretty much any technology and I think this advice would still apply. If you’re building something new, focus on the things that really move the needle and don’t try to solve architectural problems that you don’t have.

Photo by Frank Eiffert on Unsplash

Categories
Tech TIL

TIL: How to live-rotate PostgreSQL credentials

OK, I didn’t actually learn this today, but it wasn’t that long ago.

Postgres creds rotation is straightforward with the exception of the PG maintainers deciding in recent years that words don’t mean anything while designing their identity model. “Users” and “Groups” used to exist in PG, but were replaced in version 8.1 with the “Role” construct.

Here’s a map to translate PG identifies to a model that will make sense for anyone who is familiar with literally any other identity system.

PostgresLiterally anything else
RoleUser
RoleGroup
RoleRole

Now that we’ve established this nonsense, here’s a way of handling live creds rotation.

CREATE ROLE user_group; -- create a role, give it appropriate grants.

CREATE ROLE user_blue WITH ENCRYPTED PASSWORD 'REPLACE ME' IN ROLE user_group;

CREATE ROLE user_green WITH ENCRYPTED PASSWORD 'REPLACE ME AS WELL' IN ROLE user_group nologin; -- This one isn't being used yet, so disable the login.

That gets you prepped. When you’re ready to flip things.

ALTER USER user_green WITH PASSWORD 'new_password' login;

Update the creds wherever else they need updating, restart processes, confirm everything is using the new credentials, etc. Then

ALTER USER user_blue WITH PASSWORD 'new_password_2' nologin;

Easy, peasy.

Categories
Tech

Kubernetes EC2 autoscaling for fun and profit

I’m drawn to the puzzles of distributed systems and abstracted platforms – the problems that only crop up when lots of moving pieces work in tandem (or not!).

I recently encountered one of these issues a few weeks after a platform migration to AWS EKS.

The symptoms

The initial problem manifested itself as an application async worker issue.

  1. Async process queues began stacking up and triggering alerts.
  2. Investigation of the worker process revealed that:
    • Workers reported healthy
    • Workers seemed to be processing the maximum number of threads per worker
    • Workers were using minimal compute resources
    • Some of the queue was getting processed
  3. Re-deploying the async worker Kubernetes (k8s) pods resolved the immediate problem and the queues started draining again.

Our core app reported a small number of failed database requests at the same time that queues started stacking. This pointed us at the network and our DB connection pooler, pgbouncer, both of which looked fine. However, a couple of pgbouncer k8s pods had migrated to different k8s nodes a few minutes before we saw the queue issue.

This got us looking at node autoscaling. The node the migrated pgbouncer pods were running on had been autoscaled down, forcing their restart on another node. This is expected behavior. It was, however, unexpected that our async workers’ connections to pgbouncer wouldn’t time out and attempt a re-connect.

The async worker threads were holding open connections that would never complete or fail, stalling them out and preventing them from processing new queue items.

Attempts to fix

Wanting to lean into k8s’ transience and statelessness we approached the problem from a few angles:

  1. Liveness probes w/ a DB connection health check – We already had these configured for pgbouncer and most other components but not for our async workers. Thinking through this option, we suspected there would be issues with false negatives so decided to put it on the back burner.
  2. Async worker DB connection time outs – This had global app ramifications as it required reconfiguration of the pg gem or Rails ActiveRecord, both of which felt like bad options and turned out to actually be pretty gnarly when tested.
  3. Configure a k8s container lifecycle hook for pgbouncer. This was already in place but didn’t appear to be working consistently.
  4. Set up a dedicated node pool just for pgbouncer and disable autoscaling. This seems to be what most people running pgbouncer in k8s are doing, but it felt bad philosophically, so we set it aside as a last resort.

Most effort focused on the lifecycle hook option. Making sure pgbouncer received a SIGINT instead of a SIGTERM let it close out running connections safely and reject new connections. This looked like it was going to solve the problem.

It did not.

Sadness… then hope

This issue plagued us for a few weeks while we worked on higher priority items and performed research / testing. The problem was consistently tied to autoscaling down and happened at roughly the same time, but didn’t occur every day or every time a pgbouncer-hosting node scaled down.

Having run out of attractive options, we built out a dedicated node pool for pgbouncer and began testing it in QA. Again, this felt bad – adding a static component to a dynamic architecture.

Prior to deploying to production, we had another queue backup.

We looked at the pgbouncer logs during the last autoscaling event and noticed that neither SIGINT or SIGTERM were getting called via the lifecycle preStop hook for the container. Then we looked at how the node was getting autoscaled and compared to an event where SIGINT was issued (and the preStop hook did trigger).

When the k8s cluster autoscaler was responsible for autoscaling a node down, SIGINT was sent and pgbouncer shut down gracefully. When AWS autoscaling performed a rebalance (making sure an equal number of instances is running in each availability zone), neither SIGINT or SIGTERM were sent to the pgbouncer pod and it died gracelessly.

This explained why the issue had been inconsistent – it only happened after the k8s cluster autoscaler scaled down and then the AWS autoscaler performed a rebalance across availability zones and just happened to pick a node with pgbouncer on it.

Turns out, this is a known, if lightly documented issue. Spelunking in the Kubernetes Autoscaler docs revealed:

Cluster autoscaler does not support Auto Scaling Groups which span multiple Availability Zones; instead you should use an Auto Scaling Group for each Availability Zone and enable the –balance-similar-node-groups feature. If you do use a single Auto Scaling Group that spans multiple Availability Zones you will find that AWS unexpectedly terminates nodes without them being drained because of the rebalancing feature.

Which was our exact scenario. Derp.

The workarounds people are using:

  1. The thing in the docs – creating separate AWS autoscaling groups in each AZ and letting the k8s cluster autoscaler handle balancing with the –balance-similar-node-groups flag . This is kind of ugly and introduces more scaling and load-balancer complexity on the AWS side.
  2. Creating a node drain Lambda tied to an AWS autoscaling lifecycle hook that pauses the termination and issues drain commands on the node that’s being scaled down to let pods migrate gracefully. There is almost zero documentation for this solution. An example was added to the AWS-Samples repo about a week prior to us discovering the root cause.

“How did I get here?”

It’s pretty simple, at least in hindsight.

There is much about Kubernetes that feels automagical, especially when you’re relatively new to it. Some of that is by design, it’s meant to abstract away a lot of the pain/details of the underlying infrastructure. In this case, I trusted too much in the automagicalness of the cluster autoscaler and assumed it was more tightly integrated into AWS than it currently is.

This is not a gripe, more a testament to how impressive the Kubernetes ecosystem is overall. It generally is automagical, so it’s easy to make assumptions. This is just an edge case the automagic hasn’t reached yet.

Categories
Learning People Tech

What I love about SRE

My childhood was soaked in science. As I learned the alphabet and how to tie my shoes, my dad spent his days taking water samples and caring for the fish that made up the research cohort for the aquaculture study he ran. We lived at a research site on the lake and I toddled along through three hot summers, staring into the eyes of whiskered catfish and witnessing the hard, mundane work of science interwoven through our daily lives.

I did a search recently and the only monument to that time I can find is an eight page document that basically says “meh”.

One morning, years later, I sat in my dad’s lab injecting nematodes into hundreds of tiny, clear Dixie cups full of dirt samples, some of which would later be paired with marigold extracts. It wasn’t the most exciting “Take your son to work” Day, but once I developed a cadence there was a calming quality to it and time melted away.

It was more interesting to me as an adult, when I learned this type of research, as boring and un-sexy as it is, impacts whether millions of people get enough to eat.

Our living room and porch were filled with hybridization experiment rejects – peppers, squash, and random erosion-control plants in an assortment that would in no way be considered normal by anyone who actually raised house plants. They were misfits that didn’t have the right taste, shape, structure, or hardiness to make it to the next round and the smell of their potting soil and green of their leaves transformed our house into a primordial jungle. For all of my dad’s commitment to the logic of science, a bit of animism also threaded through his work. He’d feel bad if he had tilled these plants back into the dirt or tossed them into an incinerator.

These “failures” were each data points and lessons. Some of those lessons were “don’t touch this and then rub your eyes”.

All of these objects and experiences embedded a system of discovery in me (Some might call it a method 😜.): make a guess -> try to prove your guess wrong and measure the results -> analyze and iterate. This method is a tool that helps reveal the fabric of reality. It’s the best thing humans have ever come up with.

Growing up surrounded by the practices of science taught me to find interest and beauty in the outwardly mundane, that there was opportunity, even in the most boring-seeming places, to discover something that no one else in the whole world knows – at least for a brief moment.

This kind of childhood inspired me to be curious and persistent. Other aspects of growing up weren’t great but this part of growing up was as close to perfect as I can imagine and I am grateful for it.

My career has meandered its way not into the biological or physical sciences, but into something we’re currently calling site reliability engineering – a strange amalgam of systems administration, performance management, software development, quality control, and the crafting of antifragility, a practice I don’t really know what to call other than “applied statistics”. In the narrowest view, SRE can be limited to a fancier name for release management, but in most organizations there is runway to make it much more.

As with any maturing discipline, people find different areas of focus, but the aspects of SRE that appeal most to me are those that mirror what I saw growing up, areas where the scientific method can be leveraged to chip away at problems that have, up until very recent history, only been attacked with intuition and business process consultations.

SRE doesn’t hold a monopoly on this approach. Anyone can start challenging assumptions with “What do we think is true and how would we know we’re wrong?” questions, but there are some unique, SRE-specific opportunities for experimentation at scale and within the distributed systems that SREs manage. And because of its inherent technical nature and practitioners’ comfort with data, SRE (along with data engineering and finance) provides a good beach head for science to wiggle its way into the rest of a business.

Science manifests itself in SRE in expressions as simple as “How do we measure and increase reliability? When and where do we encounter diminishing returns?” That’s a good place to start, but not where anyone should stop. Continuing the line of questioning of “What matters to us and how do we keep ourselves honest?” provides a lot of opportunities to provide value and make interesting discoveries.

Questions you ask could lead you to dig into cloud provider bills, or analyze access patterns to blob storage, or purposefully inject failure into systems to find their weaknesses. Managing servers is part of the job in a similar way to my dad having to feed the fish he was studying. It’s a base requirement, but it’s not the point.

The really interesting opportunities in SRE present themselves when you open yourself up to a broader definition for your role, what questions you should be asking and to whom. Thinking more broadly than what you need to do to address the bottom tiers of Maslow’s hierarchy of needs for systems operations allows you to affect change and make useful discoveries. This is where I thrive and find the things I love about doing SRE work – having real things to measure, make decisions about, and improve through a methodology that requires you to be honest about the world you live and work within.

Categories
Learning

Books from Q4 2018

Continuing from Q1, Q2, & Q3.

48. “The Worst Journey in the World” – Apsley Cherry-Garrard

The title of this book is pretty accurate. It covers the Terra Nova Expedition to the South Pole during 1910-1913 – a failed race (Roald Amundsen won) that resulted not only in reaching the pole late, but in the death of the main expedition teams. Overall, it’s a very un-even book and could have used a more forceful editor. The story and mission of the team are compelling, but the interesting bits are bogged down by redundant descriptions of countless identical mountains, valleys, and ice floes.

49. “Blood of Elves” – Andrzej Sapkowski

Schlocky low fantasy at its finest. This wasn’t nearly as terrible as I was expecting and I’ll likely finish the rest of the Witcher series.  It’s obvious that there’s a bit lost in translation from Polish to English, but once you get through the first few chapters you don’t notice it as much. There’s considerably less misogyny than in the Witcher games, but it is only book one.

50. “Hollywood Dead” – Richard Kadrey

The Sandman Slim novels are a guilty pleasure of mine. Kadrey is weirdly inconsistent book-to-book and this was one of the lesser entries in the series. The end does a bit of tacked-on setup for a sequel but I think it would have been OK to end the series here. It doesn’t feel like there’s much ore left in the mine.

51. “The Time of Contempt” – Andrzej Sapkowski

More Witcher.  The translation work gets better, probably the original authorship as well. 

52. “Baptism of Fire” – Andrzej Sapkowski

These titles are terrible. Maybe they sound better in Polish. Sapkowski hammers on some fantasy cliches pretty hard. The pace and world building are entertaining, but there’s a lot of deus ex machina as the series  goes on and stuff like “how magic works” becomes increasingly nonsensical.

53. “Tower of Swallows” – Andrzej Sapkowski

Seriously, what is with these titles? A non-linear narrative starts creeping in for this book. It’s slightly confusing at first, but I eventually figured out that it was likely being used as a way to glue short stories together to make a novel. As that kind of tool, the narrative works pretty well.

54. “The Lady of the Lake” – Andrzej Sapkowski

Sapkowski does a decent job of tying together all the loose ends and bringing things to a close. The pace is un-even and there’s a bit of a “The 20x endings The Return of the King (the film)” effect. “Is it over?” Nope. “How about now?” Still nope. That does make the actual ending a bit abrupt. If you’re glueing short stories together, I guess there’s probably not a clear way to end your book other than to stop glueing and that’s pretty much what happens.

55. “The Self-Driven Child” – William Stixrud and Ned Johnson

The core of this is pretty much “Control is an illusion. Let your child make mistakes and be there to support and advise them.” I can’t really say that it was good or bad, just that it was written for a much more anxious person than I, for I am a blade of grass in the breeze.

56. “Season of Storms” – Andrzej Sapkowski

This book is nearly unreadable. I made it three chapters in and couldn’t go any further. It starts with what could only be described as a scene from Fantasy Law & Order, with characters, including a sorcerer and a witcher, exchanging legalese in Latin. I feel dumber for having read the few pages I did.

57. “Kubernetes Up and Running” – Brendan Burns, Kelsey Hightower, Joe Beda

This is one of the better “Up and Running” O’Reilly books. A lot of it is out of date, even though it’s only a year old, due to how fast Kubernetes is evolving. Fortunately, the core concepts are the same and most of the out-of-date knowledge is easily updatable. The only thing I’ve done with Kubernetes prior to reading this book was a convention workshop two years ago (although now that I’m thinking about it, that may have been Docker Swarm), so the hands on stuff was really helpful to get me up to speed enough to start building things.

58. “Designing Data-Intensive Applications” – Martin Kleppmann

I don’t have enough computer science knowledge to get the full benefit of this book. It goes deep into things like b-trees and the history of replication algorithms that I lack the depth to grok. There were a few chapters that I understood reasonably well and were interesting to me, but several where I mostly thought “I know what these words mean individually…”. It has an illustrated map at the beginning of each chapter, which was fun.

59. “Symphony for the City of the Dead” – M.T. Anderson 

That the Soviets measured how well things were going in Stalingrad during its siege by how many people were being arrested for cannibalism says a lot about both the contents of this book and the Russian experience of WWII, which really isn’t covered much in US history lessons. The book itself is odd but engaging. Somewhat set up as a biography of Shostakovich, he mostly fades into the background of the hell being visited upon Russia during his life. Maybe that’s partially the point though.

60. “An Indigenous People’s History of the United States” – Roxanne Dunbar-Ortiz 

I don’t know what to say about this book other than everyone should read it.

61. “American Indian Stories and Old Indian Legends” – Zitkala-Sa

I’m glad this exists as a historical record, but it doesn’t really succeed as a book. It’s a mish-mash collection of Zitkala-Sa’s serialized autobiography and some Sioux folklore.

62. “October” – China Miéville

I wanted this to be better than it was. I picked it up as it is the confluence of two interests – Russia and China Mieville, whose weirdness I thought would do justice to the weirdness of the Russian revolution, but he mostly plays it safe with a not-that-interesting, month-by-month narrative leading up to the October Revolution.

63. “Factfulness” – Hans Rosling

I bought a box-worth of this book to give as Christmas presents. It’s soooooo good and is exactly the book the world needs right now, filled with hope and wisdom. It is the perfect mix of STEM comms with both clearly translated data and infectious passion. This was by far my favorite book of any I read this year.

64. “Scarcity” – Sendhil Mullainathan and Eldar Shafir

It was likely a mistake to read this after “Factfulness”. It’s possible that it is an OK book, but it pales in comparison and follows the standard slog of “here’s a mildly interesting thesis followed by 300-400 pages of repetitive narrative to back it up.” 

65. “The City of Brass” – S. A Chakraborty

Middle-Eastern-themed fantasy. Yaaaas! None of the historical context makes any sense given the anachronisms referenced but the novelty of the setting and mythology relative to what normally makes its way into Western readers hands makes up for it. I’m looking forward to the sequel.

66. “Annihilation” – Jeff VanderMeer

VanderMeer’s Southern Reach trilogy reminds me a lot of Stanislaw Lem’s “Solaris“. The film treatment of “Annihilation” is tons better than “Solaris” though (I can’t speak to the Russian film version as I have not seen it.). Like “Solaris” I would caution against reading this before bed as you will have strange, potentially disturbing dreams.

67.  “Authority” – Jeff VanderMeer

See above.

68.  “Midnight’s Children” – Salman Rushdie

I tried, but this got super-boring after the first 100 pages or so. I get why it was/is an important book in the context of India giving the middle finger to colonialism and such, but Rushdie’s writing comes off very smug in the “Look at me, I’m a literature major”-type of way. I feel the same way about Gabriel García Márquez though so maybe I’m the problem.

69.  “Things Fall Apart” – Chinua Achebe

This was a good reminder that I need to read more African literature. I came away challenged by it and need to read more about how it was received within Nigeria as the author is pretty matter-of-fact about “Here are some things that sucked prior to colonialism and here are some sucky things white people brought. Both sets of things suck.”

70.  “Naked Lunch” – William S. Burroughs

No. Just no. Paul Bowles remains the only Beat writer I can tolerate. 

71.  “Boom Town” – Sam Anderson

I wish more sports writers would pivot to non-sports writing. They come at their subjects with so much energy. Anderson’s subject in this case is Oklahoma City which he tackles in a way I don’t think anyone ever has, addressing both the city’s strangeness and banality. I am somewhat biased, being a resident of OKC, but enjoyed the book a lot and think non-residents would as well. Other residents have complained about how much space is given to Wayne Coyne, given that he mostly seems to be a sad, coked-out, creepy-old-man these days, but that all seems to fit the nature of the place to me. 

72.  “Rework” – Jason Fried and David Heinemeier Hansson

I enjoy the sentiments of Fried and DHH’s writing more-so than the writing. That’s possibly an issue with the format, given that their books tend to be blog posts edited into book form. There’s probably a jokey metaphor about the Ruby programming language hidden in that description, but I’m not savvy enough to reach for it.

73.  “Fear and Loathing on the Campaign Trail ’72” – Hunter S. Thompson

Matt Taibbi’s intro to this book sums up my thoughts on Thompson pretty well. If you read Thompson thinking that his drug-fueled escapades are the point, you are, in fact, missing the point. Thompson’s vices served to help quiet the voices in his head and help him cope with a world he wished was different. Taibbi does hand-wave past Thompson’s racism in the book a bit too easily – there’s… a lot of it.

Also, the recent wave of people trying to rehabilitate Nixon can all go die in a fire.

74.  “Acceptance” – Jeff VanderMeer

See above for the comments on the Southern Reach trilogy in general. As for the closing book, a few threads get away from VanderMeer, but he does a good job of wrapping things up. I hope he keeps writing weird stuff. 

75.  “All the Real Indians Died Off” – Roxanne Dunbar-Ortiz and Dina Gilio-Whitaker

This is a really accessible intro into modern issues and misconceptions around North America’s indigenous peoples. I finished it while visiting my home town for Christmas, which happens to be the capital of the Choctaw Nation. There are a couple of interesting chapters on the standards that indigenous people are held to by others, especially related to authenticity and being “too white”.