An appeal for simple infrastructure architectures

It’s two in the morning. Your phone has vibrated itself off the nightstand — dozens of notifications, none of them particularly helpful.

You stumble into the living room, stubbing your toe on the couch. You curse and hobble over to where you left your laptop.

It wakes from sleep mode. You login, pull up monitoring. The app is down. Everything is down. The world is on fire and you have no idea where to start throwing water.

Past you screwed present you. Past you thought it would be neat to use containers in production. Past you thought it would be awesome to build a 1000-piece delivery pipeline. Past you bolted every tool you could find onto the world’s dumbest sysadmin robot.

Past you forgot to keep it simple, stupid.


I keep seeing charts of “DevOps” tools that look like the linnaean tree of life.

While having dozens of options for each step of a config management or a CI/CD pipeline is awesome, it doesn’t come without issues. These aren’t the fault of the tools themselves, but how they are used.

As it’s become code and tool sets have matured, infrastructure (particularly delivery-focused infra) has been kneecapped by the same problems that plague code, dependency hell being one. Ops engineers tend to forget that every piece added is not one dependency, but a family of dependencies. It’s turtles all the way down, and soon you will find yourself troubleshooting into infinity.

Often the push back to use more tools is that the current tool isn’t “great” at something. This is usually true, but is it good enough? Nine times out of ten, using one “not great” tool is going to be less of a headache than adding another tool that you’ll have to monitor, troubleshoot, and manage.

Some attempt to build “best of breed” solutions, which is misguided to begin with. There is no “best of breed” for CI/CD or reliability engineering. “Best” is what works for your particular apps and what you can count on to not lose its mind in the middle of the night.

Search Google and you will find architectures that leverage 500 different tools to get code from commit to live production and keep the app running.

You’ll see Puppet stacked on Chef, stacked on Docker, stacked on Kubernetes (I have seriously seen this in wild.), via CloudFormation templates generated by Troposphere, being fed by Jenkins, Artifactory, and Subversion, stacked on Rundeck, stacked on ServiceNow, plugged into countless other things.

This is insane. Please don’t mirror these architectures. They are the fever dreams of engineers who build automation for the sake of automation. This is how people end up designing in thousands of hidden failure points while trying to stamp out single points of failure.

Most of all, it is infrastructure and deployment that is not reliable, which counters the main goal of building automation in the first place. There are too many moving pieces, too many opportunities for something to go wrong. It’s also usually too complex for any one person to really understand.

Infrastructure-as-code also inherited a tendency towards over-abstraction. It’s one thing to use IaaS or PaaS, another to build black-box abstractions that you have to support on your own. There are trade-offs, of course, but they’re rarely considered when there’s too strong a focus on “neat” or “novel”. Containers (which have some excellent use cases) are a good example:

To be reliable, your tool set and configurations have to be, not necessarily simple, but as simple as possible.

The Space Shuttles weren’t simple, they required multiple layers of redundancy and had inherent complexity, but I guarantee you NASA engineers weren’t adding extra sprockets just because they read a blog about them one time.

An as-simple-as-possible solution might look like… just Jenkins. It might be Ansible, Jenkins, and CodeDeploy. It might be 10 well-justified tools, but it certainly isn’t 50.

Any pride from an architecture you’ve designed should come from how little you use, not how much. Building simple is hard, way harder than leveraging all the shortcuts that layering tools on top of each other provides. But, unless you want to build monsters that wake you up in the middle of the night, simplicity is required.

The Dreyfus Model of Ops Engineering

You spend the first part of your career implementing simple designs, because that’s all you know how to do. It’s what you learned on a blog. It’s how the senior engineer taught you.

You get frustrated by how long it takes you to do stuff that others around you are flying through. You feel like you’re drawing in crayon.

Then, as you learn, you get faster and bolt on complexity. Checkbox here, change from the default there. You’re getting the hang of this.

You start to feel confident, bordering on cocky. Your diagrams have more lines and boxes. Soon, you’re teaching others and talking about the bleeding edge of the field.

Look at you, you genius, you’re building technology for the ages, even though the old guy in the corner of the room keeps saying “This seems a little complex…” Screw him. He doesn’t understand your brilliance.

Then, maybe, your world slowly gets bigger. Maybe you aren’t so brilliant after all. You start thinking more about consequences and downstream effects, margins and trade-offs. You see your Rube Goldberg machines floating in a sea of chaos.

Your approach changes again.

Maybe one less box would be OK. Maybe we don’t need that line. Your designs begin to look more like the ones you started your career with. You solve for 80% and push back on things that shouldn’t be fixed with technology instead of saying “yes” just so you can come up with something impressive.

You spend more time removing things from your Visio drawings than adding them.

You nod your head and smile when someone draws their precious maze on a white board. You sit in the corner of the room and occasionally ask “Do you think this might be a little too complex?”

Required reading