Dirty secrets of DevOps

I’ve read dozens of DevOps success stories, tales of bold IT leaders transforming their business and steering big corporate ships into the future. It’s hard to avoid all these stories about “DevOps transformation” and how some guy named Jim pulled the IT department out of the stone age. The trades, blogs, and conference presentations are filled with them.

No one talks about the failures though, very few even write about their struggles implementing DevOps. It’s all sunshine and rainbows, which sucks, because that isn’t real.

Success teaches us little other than survivorship bias and how to feel bad about what we haven’t achieved. Failure and hardship are where we learn. That’s where the good, meaty stuff is.

So here are a few dirty secrets of DevOps.

Most companies that say they are “doing DevOps”, aren’t.

Because of all the success stories (real or imagined) that have wiggled their way into the minds of CXOs, “we should be doing DevOps” became an empty corporate directive that inspired thousands of executives to start calling their IT infrastructure groups “DevOps” instead of “Infrastructure” or “Systems” or “Operations”.

Unfortunately, this seems to often play out as “We’ve renamed the group, so we’re going to be letting most of the team go, because we’re DevOpsing all the things now and being lean and mean. Also, the developers are still a separate group and they’ll be throwing more stuff over the wall to you.”

So you end up with a few overworked, traditional Ops folks trying to keep the wheels on the bus with zero changes to the way work is managed or how the IT group functions day-to-day. Their manager is shouting down the hall about automation while the poor Ops teamis trying to pivot a SAN-installing, server-racking skill set into something that looks like a cave-man version of coding.

The only metric that improves in this scenario is “personnel cost”, and only temporarily because burnout and churn spike, driving up staffing costs a few quarters down the road. But it looks good long enough for someone to say “See, we did it!” and feel validated.

Even if you get the “IT folks” on board, getting plugged into the business so DevOps practices can benefit other groups and the overall bottom line comes with its own challenges.

Fixing this issue requires a lot of skill managing upward and sideways. Often times, it’s not worth trying to change and moving on is a better option. Your mileage may vary.

Implementing DevOps is Really Fucking Hard™

DevOps is all about people and process, getting everyone working together to do fewer dumb things, and smart things faster.

Historically, getting people to work together and not be jerks to one-another has been a bit of a challenge. Humans achieve awesome things when we collaborate (like spaceships and lasers), but we usually suck at working together. Because of that, I’m always impressed when I come across an excellent people-whisperer, someone who can motivate different groups to work towards a common goal without burning down each other’s village.

Problem is, there’s like five of those people on the face of the planet and chances are, they don’t work for your company. You might have lucked out and have 1 or 2 folks who are kind of OK at people-wrangling and peace-keeping, but most businesses (especially the bigger ones) seem about three seconds and a passive-aggressive sticky note in the office kitchen away from an all out blood bath.

Assuming you can get people working together, you’re now faced with the challenge of implementing process. You probably have one person on your team who loves process. Everyone else hates process and that person.

You’re never finished

There’s no such thing as “we achieved DevOps”. It’s a practice like healthy living or Buddhism. There has to be a champion(s) on your team who pushes every day to make things better.

When someone talks about DevOps success, what they’re really talking about is achieving flow, that there is a functional work process in place that is continually measured and improved upon. It’s an ouroboros value pipeline.

That’s not something you can arrive at and stop tending to. Without constant care and feeding, the processes you worked so hard to implement will start to die off.

No champion, no DevOps.

All that being said…

None of this means that DevOps isn’t worth doing, just that you need to be realistic about the challenges you’re going to face. I’ve leaned on hyperbole pretty hard to swing the pendulum away from sunshine and rainbows, because reality is somewhere in the middle.

Getting Dev and Ops (and Security), groups who have traditionally walked down the hall waving their middle fingers at each other, to 1.) work together and 2.) implement and adhere to process, will likely be one of the most difficult things you’ve attempted in your career. You have to put a lot of work into making the right things the easy things, reducing friction wherever you can. Setting mandates or badgering doesn’t work, you have to sell the value.

Getting top-down buy-in (and understanding!) of true DevOps/Agile practices is hard. It requires reorganizing groups and a sustained sales pitch to all involved. The need for this trails off a little once the business and IT staff start seeing value, but expect it to be a long, sustained effort. I’m always a bit dubious when I hear something like “we transformed IT in three months” – either that group really has their shit together, someone is lying, or we’re not using the same dictionary.

For practitioners and evangelists, these are the things we need to start talking about more. There’s a slick consultant vibe that’s weaved throughout discussions about DevOps that glosses over the practical and prescriptive. Too many of these conversations focus on high-level what-to-dos and not enough on concrete how-tos and context, especially when it comes to people-centric issues.

Admit your struggles, help others

My worst grades in K-12 and university were all for math classes. Algebra, trig, calc… they all made me feel stupid to a point of hopelessness. I got the basics, but as soon as things started getting abstract, I lost my grip.

I tried and tried and tried, but none of it clicked into place.

I’ve never been one to give up easily on solving problems, but I gave up on math and that steered me away from scientific fields and programming, even though I had interests in both. I got excited about the prospect of working in physics or engineering, but a little voice would always echo out from the corner of my mind telling me:

“You can’t, because you don’t have the math for it, because you’re stupid.”

I never stopped reading about science though. If I couldn’t be a physicist, I would enjoy what I could from the periphery. Past articles and topical books, I also read biographies of scientists, one of which ended up changing my life.

Richard Feynman

I was well into my 20s when I picked up a copy of “Surely You’re Joking, Mr. Feynman!“, which is actually more of a collection of transcribed stories than a biography. Feynman was an interesting personality, so different from any other scientist I’d read about. I snapped up every book I could find about him.

He worked on the Manhattan Project and built the foundation for a wide swath of modern quantum physics. He also played bongos, experimented with LSD, and sat around strip clubs drawing sketches of dancers.

Past all that, his approach to math was completely new to me. He intuited and guesstimated, refining once he was in range of the answer. It was a weirdly physical approach to math that lined up with the way my grandfather (a country carpenter with a 5th grade education) worked out building a house.

I focused my reading and found out it’s the way a lot of mathematicians and physicists approach math, I’d just never put it together before.  The New Math I grew up on and that was hammered into my head simply didn’t fit the way my brain worked.

Feynman’s method of estimation and refinement did.

So happy I could cry

Suddenly, a new world felt open to me. I threw myself into Khan Academy‘s math courses, starting with the 1st grade level and blazing through each into college level calculus.

“It makes sense! It makes sense! It makes sense!!!!!!!”

With my new toolset, I could make sense of all of it and swelled with a new confidence. I started doing math problems for fun, filling notebooks with tiny scribbled numbers.

I also dove into programming, pivoting my career as a sysadmin towards automation and orchestration. It was frustrating at first, not because it was difficult, but because learning to program didn’t require nearly the level of math I thought I needed to get started. Fear of math had been an unnecessary roadblock.

Sharing your struggles is important

No one likes to admit they had a hard time learning something, that they struggled and felt dumb or less-than. If you care about helping others though, sharing those experiences is important.

For much of my life, I thought understanding math was binary, you either got it or you didn’t, because that’s what I saw in everyone around me. It’s what I heard from math teachers, some of whom got frustrated by my lack of understanding.

That all changed when I learned people I admired faced similar challenges and developed workarounds that worked for the way they thought. I’m no mathemagician, but gaining hope from knowing that I wasn’t alone allowed me to build a competency I would have otherwise gone my entire life without having.

Right now, there’s someone out there who is struggling to understand some concept that you also struggled with. They’re feeling hopeless and dumb. They don’t just need a numbered “how-to” or explainer. They need a preamble that says “Hey, I had a hard time with this, and this is how I learned it.”, not only because your method might work for them too, but because exposing your vulnerability opens the doors of possibility for them.

If you’ve never written anything or put yourself out in the open in a similar way because you think “There’s already 1000 blog posts explaining this.” or “Who cares about my perspective?” Know this: Your perspective matters. Your struggles matter. They may not matter to everyone, but they’ll impact someone.

So write and share and be honest. Sitting in a high tower, pretending that everything you know came easily helps no one. There’s someone out there who will only ever understand data structures or synchronicity or whatever it is you know (particularly if you struggled with the idea) unless you explain it to them the way you understand it.

Empathy in Dev and Ops

You read through the code. You read it again to make sure you understand what it’s doing. Your left eye starts twitching. You read the code a third time.

“WTF was wrong with the person who wrote this?”

I hate how often I react this way. It’s a quick default that’s hard to reset — immediate annoyance as if the developer or engineer responsible for writing whatever I’ve come across was scattering landmines.

It’s easy to shit on the people who came before you. They’re usually not around to defend themselves or provide context. It’s a common (but terrible) way to prop oneself up and display an illusion of competence. “Look how much I know, and they didn’t!”

It’s much harder to calm down, empathize, and think things through. An initial reaction of “WTF?!” is entirely valid (You’re gonna feel what you’re gonna feel.), but getting stuck on the frustration and not going further is unfair to your predecessors and causes you to miss out on learning.

How? vs. Why?

“How” a problem was solved/band-aided/kicked-down-the-road is usually the root of those frustrations, but the next step is thinking through the “Why” of the solution, which is often the source of useful information.

The code might be stupid, but there’s usually a reason. Maybe:

  • Something stupid upstream brought its stupid with it. Alternatively, something stupid downstream needed more stupid.
  • The dev/engineer was told to do it that way.
  • The dev/engineer was getting pulled in 1000 different directions and needed to make a fast band-aid.
  • The dev/engineer was doing the best they knew how.
  • It’s actually not stupid. You just think you know more than you do.

That doesn’t rule out laziness or malice, but they’re much rarer and shouldn’t be the default assumption. When we run across goofy looking code and configs, we need to respect the constraints and context the person who wrote them faced.

Note: That the person might actually be an idiot is a real, intractable constraint. How would you have fixed that? Yelled “Be smarter!” at them?

Thinking through and learning the “whys” that caused the stupid will help you understand the context of the problem you’re currently facing. You’ll learn about not only technical pitfalls, but cultural ones as well. A lot of the stupid that shows up in code has nothing to do with the technical competence of the person who wrote it and everything to do with their manager or the company at large.

Grace and placing blame

How many times has the past version of you done something stupid that harmed future you? How many times have you looked at something you made a year ago and thought “What was I thinking?”

Like any skill, if you’re not embarrassed by some of the code and configs you’ve written in the past you’re 1.) an egotistical monster, and 2.) not getting better. Knowing that, allow some grace for yourself and the people who came before you.

When DevOps practitioners talk about establishing culture, one of the big elements they discuss is creating trust by not focusing on placing blame. That idea doesn’t just apply to your immediate co-workers. It applies to those in the past as well. Trashing prior devs/engineers can have the same effect as trashing the people you currently work with, it just adds a caveat to the way the team thinks about trust.

“Well, they’re not trying to throw me under the bus right now, but…”

I can’t say I’ve mastered this skill yet. Sometimes, in moments of frustration, I flat out suck at it. But I’m trying and that’s kind of the crux to all this. Everyone is trying, no one has arrived, and the more we empathize with the unknown constraints of those who came before us, the better off we’ll be.

An appeal for simple infrastructure architectures

It’s two in the morning. Your phone has vibrated itself off the nightstand — dozens of notifications, none of them particularly helpful.

You stumble into the living room, stubbing your toe on the couch. You curse and hobble over to where you left your laptop.

It wakes from sleep mode. You login, pull up monitoring. The app is down. Everything is down. The world is on fire and you have no idea where to start throwing water.

Past you screwed present you. Past you thought it would be neat to use containers in production. Past you thought it would be awesome to build a 1000-piece delivery pipeline. Past you bolted every tool you could find onto the world’s dumbest sysadmin robot.

Past you forgot to keep it simple, stupid.

I keep seeing charts of “DevOps” tools that look like the linnaean tree of life.

While having dozens of options for each step of a config management or a CI/CD pipeline is awesome, it doesn’t come without issues. These aren’t the fault of the tools themselves, but how they are used.

As it’s become code and tool sets have matured, infrastructure (particularly delivery-focused infra) has been kneecapped by the same problems that plague code, dependency hell being one. Ops engineers tend to forget that every piece added is not one dependency, but a family of dependencies. It’s turtles all the way down, and soon you will find yourself troubleshooting into infinity.

Often the push back to use more tools is that the current tool isn’t “great” at something. This is usually true, but is it good enough? Nine times out of ten, using one “not great” tool is going to be less of a headache than adding another tool that you’ll have to monitor, troubleshoot, and manage.

Some attempt to build “best of breed” solutions, which is misguided to begin with. There is no “best of breed” for CI/CD or reliability engineering. “Best” is what works for your particular apps and what you can count on to not lose its mind in the middle of the night.

Search Google and you will find architectures that leverage 500 different tools to get code from commit to live production and keep the app running.

You’ll see Puppet stacked on Chef, stacked on Docker, stacked on Kubernetes (I have seriously seen this in wild.), via CloudFormation templates generated by Troposphere, being fed by Jenkins, Artifactory, and Subversion, stacked on Rundeck, stacked on ServiceNow, plugged into countless other things.

This is insane. Please don’t mirror these architectures. They are the fever dreams of engineers who build automation for the sake of automation. This is how people end up designing in thousands of hidden failure points while trying to stamp out single points of failure.

Most of all, it is infrastructure and deployment that is not reliable, which counters the main goal of building automation in the first place. There are too many moving pieces, too many opportunities for something to go wrong. It’s also usually too complex for any one person to really understand.

Infrastructure-as-code also inherited a tendency towards over-abstraction. It’s one thing to use IaaS or PaaS, another to build black-box abstractions that you have to support on your own. There are trade-offs, of course, but they’re rarely considered when there’s too strong a focus on “neat” or “novel”. Containers (which have some excellent use cases) are a good example:

To be reliable, your tool set and configurations have to be, not necessarily simple, but as simple as possible.

The Space Shuttles weren’t simple, they required multiple layers of redundancy and had inherent complexity, but I guarantee you NASA engineers weren’t adding extra sprockets just because they read a blog about them one time.

An as-simple-as-possible solution might look like… just Jenkins. It might be Ansible, Jenkins, and CodeDeploy. It might be 10 well-justified tools, but it certainly isn’t 50.

Any pride from an architecture you’ve designed should come from how little you use, not how much. Building simple is hard, way harder than leveraging all the shortcuts that layering tools on top of each other provides. But, unless you want to build monsters that wake you up in the middle of the night, simplicity is required.

The Dreyfus Model of Ops Engineering

You spend the first part of your career implementing simple designs, because that’s all you know how to do. It’s what you learned on a blog. It’s how the senior engineer taught you.

You get frustrated by how long it takes you to do stuff that others around you are flying through. You feel like you’re drawing in crayon.

Then, as you learn, you get faster and bolt on complexity. Checkbox here, change from the default there. You’re getting the hang of this.

You start to feel confident, bordering on cocky. Your diagrams have more lines and boxes. Soon, you’re teaching others and talking about the bleeding edge of the field.

Look at you, you genius, you’re building technology for the ages, even though the old guy in the corner of the room keeps saying “This seems a little complex…” Screw him. He doesn’t understand your brilliance.

Then, maybe, your world slowly gets bigger. Maybe you aren’t so brilliant after all. You start thinking more about consequences and downstream effects, margins and trade-offs. You see your Rube Goldberg machines floating in a sea of chaos.

Your approach changes again.

Maybe one less box would be OK. Maybe we don’t need that line. Your designs begin to look more like the ones you started your career with. You solve for 80% and push back on things that shouldn’t be fixed with technology instead of saying “yes” just so you can come up with something impressive.

You spend more time removing things from your Visio drawings than adding them.

You nod your head and smile when someone draws their precious maze on a white board. You sit in the corner of the room and occasionally ask “Do you think this might be a little too complex?”

Required reading

3 Questions to ask before automating

If you’re a DevOps engineer who is constantly fighting fires and trying to keep your head down, it’s easy to get stuck in a rut of “You ask, I build”. Heck, it’s easy to think that way even when everything is running smoothly.

But that’s not what the job should be, or at least what the job can be. Not asking the right questions leads to more firefighting down the road and smooth operations devolving into mediocrity.

If you have a desire to be better at your job and build truly helpful, effective tools, before any code is written, there are (at least!) three questions that must be answered.

1. Are we solving the right problem?

Unfortunately, this question almost never gets asked and when it is asked, the answer is often “no”. A Dev or PM will send an email that says “We need to automate X.” and the DevOps engineer will asks a couple of technical questions before starting the build.

If you’re a DevOps engineer and you aren’t asking questions about process and goals, you are a bit pusher, not an engineer. You have unique insight into how the systems you work on interoperate and feed one another, you think in terms of systems and dataflows— you have a contribution to make that’s more than writing playbooks and scripts.

If the teams you’re working with aren’t volunteering their goals, you need to ask for them. “What are you trying to accomplish?” Having this context is critical if you want to provide real value. What you were asked to build is often not the right solution to the problem that needs solving.

Before you build tools, build relationships with the developers you work with and get in the habit of asking questions that aren’t just “what version of this plugin do you need?” Ask questions even when you’re offering feedback.“What if we did it this way? Would that work?”

2. Is building a tool the right solution?

If you’re confident in the process, the next question you should ask is “Is building a tool the solution to this problem?” In a lot of cases the answer is “yes”, but thousands of custom tech fixes have been built to solve problems that could have been solved more appropriately with an email, a Google form, or God forbid, two people actually talking to one another.

This is when it helps to get someone in the room who isn’t an engineer and can provide a sanity check against the “when you have a hammer, every problem looks like a nail” problem.

Being competent with Python or Ruby is not an excuse to try and solve every problem with a script. I don’t know how many discussions I’ve sat in on where people were arguing about how to build a custom tech fix and I wanted to scream “JUST HAVE THEM GO SIGN UP FOR A DOCUSIGN ACCOUNT! WE DON’T NEED TO SOLVE THIS PROBLEM WITH CODE!” or something similar.

Find someone who will call you on your BS and tell you when you’re being myopic. That person is your new best friend.

3. Who is this tool for?

Hopefully the answer to this question will come out of defining the process, but that’s not always the case.

If you find that you’re only building tools that you or your DevOps team can use, you have a problem. There are certainly circumstances where having an internal-only tool is appropriate, but the majority of what you build should be usable by the people you’re trying to help.

Don’t build a script that you run on your local machine to deploy servers that the developers ask for. Build a portal that allows the devs to deploy without your involvement. That local script may be a first step, but fight tooth and nail to remove you and your team as a dependency for getting something done.

Some engineers think they’re being helpful by responding to requests with “Sure, I’ll build that.” But if you’re doing the same thing over and over (deployments, updates, etc.) you are not being helpful. Likely, you are slowing everyone down and have the potential to bring the dev process to a complete halt.

If you want to be a SysAdmin, keep on keepin’ on. If you want to be a DevOps engineer, build tools that others can use, then get out of the way.

Originally posted on BestTech.io

Automation is not DevOps

A few years ago, if you would have asked me to define DevOps, my answer would have sounded something like “Mumble mumble automation, mumble, automation, mumble mumble, infrastructure-as-code, mumble mumble, strategery.”

Thinking that DevOps equated to automation had mostly to do with the fact that most of the DevOps people I talked to and articles I read really only spoke about automation with only indirect references to anything else.

Implementing automation is definitely a part of what it means to be practicing DevOps, but it’s maybe 5-10%.

The reason that so much of what’s been written about DevOps is focused on automation is that it’s easy to write and talk about tools. Relative to implementing a true DevOps practice, automation is the easy part.

What DevOps actually represents

I’ll use Gartner’s definition to provide a common basis:

DevOps represents a change in IT culture, focusing on rapid IT service delivery through the adoption of agile, lean practices in the context of a system-oriented approach. DevOps emphasizes people (and culture), and seeks to improve collaboration between operations and development teams. DevOps implementations utilize technology — especially automation tools that can leverage an increasingly programmable and dynamic infrastructure from a life cycle perspective. — Gartner.com

Notice that automation is the last thing mentioned. The main components of DevOps are people and process.

This is the hard stuff, the boring stuff, the stuff no one likes talking about because it involves dealing with people and managing work. But it’s also the stuff that gets shit done.

Automation is a force-multiplier, a lever. That’s all computers and software are in general — levers. Without people and process, automation is just a rusted-up socket-wrench— you can use it as a blunt instrument, but you’re not getting its full utility.

Automation alone can have an impact, but mostly in that it allows you to do stupid stuff faster.


At the heart of a successful DevOps practice is a culture that accepts failure, doesn’t focus blame, and demands collaboration and knowledge sharing.

If your team members are afraid of how their manager reacts when they make a mistake, they will not attempt new ways of doing things and the team and business will not move forward. Full stop.

Screwing up is a necessary part of learning and improving, but most companies have a culture that doesn’t allow even minor failure. I don’t know how many times I’ve heard executives earnestly say “Failure is not an option.” completely missing that they were driving their company into the ground as the industry changed and competitors sprinted past them, having embraced failure as an opportunity to learn and do something new.

Accepting failure doesn’t mean not having high expectations or accepting mediocrity. Accepting failure prevents mediocrity.

The expectation for a DevOps team member should be: “Mistakes happen, don’t try to hide your screw ups. Communicate what happened, work on fixing it, and most of all… learn from it.”

Accepting failure directly feeds into reducing the desire to place blame. Worrying about the repercussions of failure and working to deflect blame take away from doing actual work and fixing the original problem, aside from poisoning the work environment.

When co-workers trust that they’re not going to get thrown under the bus, they collaborate and are more productive. Playing the politics of avoiding and placing blame are distractions that need to be snuffed out. If you’re a manager and you allow your team to sit around pointing fingers at one another, shame on you. If you encourage it (and oh, God have I met those managers), I hope your house gets invaded by bees.

If your goal with implementing DevOps was to speed up delivery and become more agile, you have to aggressively remove roadblocks. Anything that stands in the way of the team collaborating with the business, developers, and each other needs to be bulldozed. Fear, politics, mistrust — gone.

It might feel like wrangling a kindergarten class, but you have to get your team to share. There is no such thing as “too busy to train” or “too busy to document”. If only one person knows how to do a thing, they become a bottleneck that can shutdown your DevOps factory.

Knowledge sharing has to be the expectation. If there is someone on your team who can’t be coached to not keep things secret, they need to work somewhere else.


You can have amazing automation and culture and still not be practicing DevOps, because practicing DevOps is almost entirely about process.

If you want to “do DevOps”, process is where you start and will give you a much higher return on investment than the automation that follows. Without process, there will be too much chaos to maintain a healthy culture and any tooling you build will likely be solving the wrong problems.

Start by getting visibility to the work your team is being asked to perform. This isn’t just for managers, the entire team needs insight into the work backlog, if for nothing else than to prevent duplication of work (“I already have a script for that.”) and provide time-saving context (“Doing X will cause Y to fail.”)

If there are team members who won’t share what they’re working on or only offer vague details, that has to end. Letting people slide by with “I’m working on server things…” during your daily standups (and yes, you should be doing standups), doesn’t fly. “I’m doing X, Y, and Z, today.” is the answer you’re looking for.

Without that transparency, work goes into a vortex of suck. If work is (or isn’t) being done, the team needs to know, because every unknown status compounds delays, reduces quality, and plants the seeds of distrust. A successful DevOps practice requires accountability, not for the sake of holding someone’s feet to the fire, but just so everyone knows WTF is going on and can plan accordingly.

All this requires commitment and continual reinforcement from teammates and management. There is no “We set up a kanban and no one used it so we stopped.” Failure is an option. Not knowing the answer is an option. Working ad-hoc and being lazy about process is NOT an option.

To summarize, if you are not managing work in an Agile way, using Scrum or something similar, you are not practicing DevOps.

It turns out that there are lots of great resources on the non-automation facets of DevOps. Two good places to start are The Phoenix Project and Effective DevOps. The reviews for Effective DevOps are particularly fun because most of the complaints are “the author didn’t talk about automation”, which means she got it right.

The trick when you’re doing research on DevOps is to not search for “DevOps”, because you’ll mostly get 1.) articles that are about automation, and 2.) “DevOps” job postings that are really just sysadmin jobs.

Instead, read about Agile and Lean. Read about Personal Kanban. Read things that make you a better person who can treat other humans with empathy. That’s what will make you a “DevOps Ninja”. Focusing entirely on automation just makes you a code monkey.

Originally posted on BestTech.io

A week with Puppet

Prior to the last week, I hadn’t done much with Puppet. Most of my config management experience is with Microsoft tools and Ansible.

Puppet was a contender the last time I was involved in picking a CM tool, but was ultimately ruled out. Compared to some of the newer CM tools, it felt clunky and, compared to Ansible specifically, the Puppet documentation sucks.

A week in, I can’t say that I’m a fan yet, but I’m starting to see some of Puppet’s strengths more clearly.

So far, the things I like:

Extensibility. It appears that you can integrate pretty much anything with Puppet (and that pretty much everything has been integrated with Puppet).

You don’t have to be a ruby expert to use it. Enough said.

Model-driven. This is personal preference. I get why people like procedural config, but I feel like I have to spend way more time figuring out what is going on in a Chef cookbook or SCCM/SCOM task sequence vs Puppet or Ansible.

ERB templates. None of the jinja2 crap that Ansible uses.

Some things I don’t like:

No stop on failure. If a step in your Ansible playbook throws an error, the whole playbook stops. I like this, it gives me more confidence that the end state has actually been achieved. I’m sure you can probably integrate something with Puppet to mirror this behavior, but straight out of the box if something errors, it just keeps rolling.

Random ordering. Ansible plays run from the top of the YAML doc down. Puppet just tries everything in random order unless you explicitly chain tasks together.

Sub-par cloud modules. Ansible’s modules for AWS and Azure are easier to use and seem more mature, which is odd considering how much older Puppet is. Defining and configuring a cloud stack in Ansible is more intuitive to me than what I’ve found with Puppet.

Sometimes hard to follow. As long as you’re just referencing facter data (Puppet’s inventory) or variables within Puppet manifests, it’s pretty simple to figure out what’s going on. Throw in Hiera, Puppet’s key/value DB, which may in turn be referencing other data sources and things start to get confusing.

If I was building something from scratch, I still think I’d use Ansible, but (again, only a week in) Puppet is starting to feel like a better option than it has in the past.

Reading things like Lyft’s experience with Puppet and moving away from it have dampened my expectations somewhat, but I’m hopeful I’ll find more to like than dislike as I get further along.

Originally published on BestTech.io

Three days, two tech conferences

It is 104 degrees, 120 on the sidewalk, but less humid than I am used to, which is nice.

As always, Las Vegas’ kaleidoscope of people is disorienting.

It’s one of the most interesting places in the world for people watching— desperate to prove Bill Langewiesche’s “You should not see the desert simply as some faraway place of little rain. There are many forms of thirst.”

I am always uncomfortable here.

I know myself well enough to know I can’t go straight into a convention and not experience psychic pain. If I just leap into it, the ‘peopling’ parts of my brain throw sparks and scream like twisted steel. So I practiced being social from the time I left my house.

I chatted up the airport employees, my Lyft driver, the hotel staff — everyone who presented me with an opportunity for dialog. It gets easier with each person, but never frictionless.

Part of it is my personality. Part of it is in reaction to the empty (often passive-aggressive) small talk of the South I grew up surrounded by. Part of it is a battle between curiosity and a desire to “mind my own business”, both of which have served me well.

By the time I’ve got my badge and swag bag, I can approximate the social skills of a normal, functioning adult. This trip I actually have two badges, because I am attending two conventions at the same time.

This is stupid. Never do this. It will leave you exhausted and hurting, even without following the Hunter S. Thompson event playbook.

I spend three days bouncing back and forth between Mandalay Bay and the Aria for VMWorld and Oktane, respectively.

Walking around VMWorld’s vendor floor and listening in on keynotes confirms a thought I had on the plane ride — this will be my last VMWorld, there’s nothing here for me anymore. That’s partially because of where I’m focusing my career (cloud) and partially because of VMware.

There are groups within VMware doing interesting things (or at least wanting to), but the company as a whole struggles to execute and is moving much too slowly (and randomly) to remain relevant. Their leadership communicates a new idea of “who VMware is” every year even as the company hasn’t meaningfully aligned to whatever identity they were supposed to be several years prior.

While Pat Gelsinger was telling his audience that the tipping point in enterprise cloud is still five years away, Google’s Diane Greene (one of the founders of VMware, ironically), was telling the Oktane crowd that the tipping point has already come.

Obviously they each have their reasons for spinning a specific vision of the market, but one of those visions is “come on everybody, it’s time to move”, the other is “we’ll catch up with you later”.

Watching other VMWorld attendees furiously take notes about news and technical concepts that would be quaint or old hat somewhere like AWS ReInvent seems to support Gelsinger’s take, that VMware is right in slowly building bridges to the future. But they may be building the wrong bridges.

With all the talk of VMware enabling customers to migrate their existing VMs to the cloud, I can’t shake the sense that VMware management either really doesn’t understand cloud or is hoping customers don’t.

Moving VMs from on-prem to cloud or between clouds isn’t a thing people should be doing. It’s OK as a short-term tactic, but migrating VMs is really just moving old problems and creating new ones; yet VMware seems to have focused a significant portion of their latest strategy around the idea.

At this point, it feels like VMware is throwing spaghetti at the wall and hoping the long tail of legacy tech lasts longer than anyone is expecting. This isn’t just a VMware problem (Look at the entire new DellEMC federation, for example.), but it does make me a little sad, because VMware had an opportunity to lead and be more than the shrinking funnel for hardware sales that they’ll become.

I spend most of my time at Oktane, talking to other customers and the more future-focused vendors there.

The first part of Oktane’s opening keynote runs long before they bring Malcolm Gladwell onstage, with what I assume is the hope that he will compress his talk into what remains of the keynote timeslot.

He does not. Malcolm Gladwell does not care about “only having 15 minutes left”. Malcolm Gladwell is honey badger, and provides the spectacle of watching hundreds of people who need to be somewhere else fidget and anxiously figure out what to do.

Gladwell gives a 30 min talk that leads with a description of childhood leukemia in the 1950s and the explosively hemorrhagic deaths of small children. In this moment I forgive him for his past half-baked theories.

Those extra 15 minutes have the effect of throwing the entire rest of the morning off.

A customer panel I am part of starts with the presenters scrambling to set up their A/V. Nothing works right, and one of the presenters starts the talk only to get flustered and abandon the podium, looking desperately at his co-presenter to save him.

I feel bad for both of them. Fortunately, the heckling is kept to a minimum.

This is the fun stuff you see at smaller conferences.

Okta does a good job of building on the vision they shared at last year’s Oktane, where you could see the rough shape of something coming together.

They want to be the glue that ties SaaS services together and extend their platform further into devices and infrastructure. It’s a good plan, and no one else is really executing on it in a similar way. There are API integration platforms(Mulesoft, Apigee, whatever) that let companies easily plug all their apps together, but Okta is doing it with identity.

They’re becoming the Active Directory of the cloud, which is impressive considering that Microsoft literally makes an Active Directory product for the cloud.

Where VMWorld felt like the past struggling to reach into the future, Oktanewas the future.

After three days of having to be “on”, I am worn out. I make a last sprint of being social on the car ride to the airport, spending what is left of my socializing fuel. I can’t imagine what the people running vendor booths must feel like after a week of feigning interest and pitching their product.

I used to think the point of going to conventions was to learn things. Then I started going to conventions and figured out there really wasn’t much there to learn outside of customer-led sessions.

It’s easy to wander from session to session, never engaging with anyone, but there’s little value in that. As much as I hate the concept of “networking” as it relates to people, it is necessary.

Establishing relationships with other customers gives you resources to help solve problems and get advice. Strengthening relationships with vendors helps you get things done, especially with the big vendors to which you are by default just an account number.

If it weren’t for forcing myself to be social I wouldn’t know as many escalation managers, product managers, and engineers as I do now. These relationships are invaluable, because they’re the people who can actually help you if you’re trying to get traction with a support ticket or feature request.

Meeting these people is what makes going to cities you don’t particularly like and getting out of your social comfort zone worth it. In many cases, these aren’t just relationships of utility either. You’ll meet a lot of legitimately interesting people doing interesting things. Some of them may even become friends.

Originally published on BestTech.io

Automation isn’t just for scale

A few years back, in a sidebar discussion at a tech conference, one of Netflix’s engineering managers asked me if I was using any automation tools at work.

I said, “Not really. It’s a small environment and we’re not delivering any web apps that require automation for scale.”

She gave me an amused/sympathetic look and replied, “Dealing with scale isn’t the only reason to automate things.” It wasn’t condescension; she was being kind — dropping some knowledge, but I didn’t know how to respond.

A little embarrassed, I mumbled some other excuses for why automation wasn’t a good fit, said ‘nice to meet you’, and wandered off.

I cycled through my excuses, trying to figure out if they were valid. Most of the automation and config management stuff I had used in the past had been imperative, task-sequence based stuff, like what you’d find in Microsoft System Center. When you have to do the “walk forward five steps, now extend left hand at 30 degrees, close fingers around peanut butter jar”- programming game for smaller, legacy environments, it definitely feels “not worth it.”

Days after, the conversation still bugged me. “Why do people automate their infra? Why, really?” Even after reading a ton of articles, blog posts, and whitepapers, I still couldn’t come up with anything that wasn’t ultimately a scale use-case.

I had confirmed my bias and probably would have stopped there in similar circumstances, but what the Netflix employee said had a feeling of truth that I couldn’t let go of. I kept digging.

In order to understand the benefits and justification for automation, I started automating things.

Turns out, that engineering manager had a gift for understatement.

Livestock, not pets

I grew up in a culture of IT where servers, even PCs, were treated as special snowflakes. It took a long time to reinstall Windows + drivers + software, so you did a lot of care, feeding, and troubleshooting to make sure you didn’t have to start over from scratch.

We named servers after hobbits and constellation. We got attached to them and treated each like a pet.

“Bilbo-01 just crashed?! NOOOOOOO!”

In some ways, virtualization worsened that philosophy. Things were more abstracted, but not enough to force a mindshift. You could now move your pet servers between different hardware, reducing the reasons you would have to rebuild a particular server. At great cost, effort, and risk (“You can never patch my preciousssss.”), there are businesses running VMs that are old enough to drive.

So we ended up with thousands of VMs running thousands of apps that were setup by people who have retired, switched jobs 10 times since, or stayed and now act like fancy wizards, holding their knowledge tight to their chest.

Automation is the documentation

Let’s tackle the issue of tribal and secret knowledge first.

A big component of DevOps (and the Lean concepts that inspire it) is identifying and removing bottlenecks. Sometimes those bottlenecks are people. This doesn’t mean you have to get rid of people, but you do need to (where possible) remove any one individual as a core dependency for getting something done.

“Bob is the only person who knows how to install that app.”

“Those are Jane’s servers, you’ll have to check with her.”

“We can’t change any of this because no one knows how it works.”

At the end of the day, this is a scale problem. It’s scaling your IT to be larger than one person. Part of the solution to this problem is cross-training, but automation can also help (and prevent future stupidity).

If you use a configuration tool like Ansible or Chef, the playbooks/cookbooks become the documentation for the environment. They detail dependencies, configuration changes, and service hooks that were realistically never going to be documented otherwise. If you’ve subscribed to a declarative model of automation, the playbooks not only detail what the app stack should look like— if they’re run again, they can enforce that the stack matches what’s in the playbook.

Change control

Things generally break because something changed. Maybe it’s a hardware or network failure. Maybe the software is buggy and there was a memory overrun or a cascading service failure. Maybe somebody touched something they shouldn’t have.

In olden times, a sysad would be tasked to troubleshoot the broken thing, wasting hours with Google searches and trial & error. Meanwhile, the app is down.

If you’re automating your infrastructure, that’s less of a thing. App stopped working? Re-run the playbook for the stack. Want to know why the app stopped working? Look at your run logs. Troubleshooting is still needed sometimes, but there is a lot less fire fighting when you can push a simple reset button to get things back up and running. Turn it off and on again.

For approved changes, automation requires that the changes be well defined, which is a big positive that helps everyone know what’s happening and what to expect.

This type of state enforcement could equally be considered a security measure. Some people schedule plays that run through app stacks and repair/report anything that doesn’t match the expected norm.

NO MORE (or maybe less) PATCHING!

Not everyone is able to get there, but having fully automated stacks often means you can do away with OS patching. Just rebuild the stack once a month with the newest patched OS image. Boom!

If you do have to patch, you can significantly reduce your patching and service confirmation work by building the patch installs, reboots, and health checks into your automation. This helps prevent the post-patch-night “My app doesn’t work.” emails.

Fewer backups

Even with de-dupe, I can’t imagine how many petabytes of backup data are made of up OS volumes and full VMs. If you’re automating deployment and config management, the scope of what you need to back up is greatly decreased (so is your time to recover).

You’ll really just be concerned with backing up application data. Other than that, you can make compute and the VMs your app runs on disposable. So you’ll just have to worry about having your playbooks with configs in version control and some method to backup databases and storage blobs.

This rolls into DR and failover as well. In many instances, automation will enable you to do away with failover systems. Depending on your SLAs, a recovery plan could be as simple as “re-run the playbook with a different datacenter target.”

Integration tests… for infrastructure

If you truly are treating your infrastructure as code, you can write unit and integration tests for it that go past “well, the server responds to ping”. You can also deploy into test environments very easily and run those environments more cheaply because of not having to maintain 1:1 infra full-time.

Turns out, if you make testing easier, people actually test things and you end up with better infrastructure.

This stuff is important

I get that none of these things feel very sexy, but in practice, they are game changing. As you start automating, you’ll discover that your infrastructure doesn’t work exactly like you thought it did, you’ll figure out what different apps actually need, and you’ll pull the weight of being the only person that knows something about a particular server/app off of your shoulders.

Some people like keeping secrets. They think being the only person who can do something gives them job security.

Those people are idiots. Maybe they will keep their job, but that’s not a good thing. They’ll never advance, never do anything more interesting than their current responsibilities.

Automating your infrastructure, opening up the secret knowledge to the entire team and doing away with the idea of being a hero who fights constant fires, is how you free yourself up to do better things. So build the robot, let it take over your job, and keep peeling all the layers of the onion to find work that’s more meaningful and interesting than installing patches, troubleshooting IIS, and getting griped at because “the server” is down.

You don’t have to work for a web company or be in the cloud to do this stuff (although some of the cloud toolsets are better). If you have even a small number of servers, it’s worth it. You don’t need “scale”, you just need a desire for your infrastructure not to suck.

Originally posted on BestTech.io