How to lower your AWS EC2 bill

Because spinning up VMs in the cloud is so easy, it’s equally easy for your monthly bill to scale up as well. What may have started as a few hundred dollar-a-month charge for a few VMs can quickly ratchet up into tens of thousands of dollars.

In larger businesses, I’ve seen this get so bad that management freaks out and moves everything back on-premises, even when the growth was legitimate and not the result of waste or mismanagement.

There is also a tendency for devs and ops folks to treat the financial aspects of their infrastructure and applications as “not my job”, which is an idea that makes my brain hurt from all the angry it creates.




Three tables flipped, and I still can’t write about the “not my job” angle. It will have to wait for another blog post. For now, let’s just assume that you, dear reader, are a minimally-functional adult who understands the concept of owning the things you build.


As you’re building your infrastructure, it’s always a good idea to bake in instrumentation to understand utilization. Most people view this as a must-have for troubleshooting and monitoring, but there’s also a cost angle.

Assuming you already have infrastructure, the first step in lowering your AWS EC2 cost is identifying what you actually need.

You can either roll your own solution for right-sizing your infrastructure using the performance and inventory data you should already be collecting with AWS (or your cloud provider of choice) or you could use a service like CloudHealth Tech  or CloudCheckr that are purpose-built for cost optimization.

The SaaS solutions do a lot of stuff related to cost optimization, but one of the main things they provide is utilization reporting and recommendations.

Maybe you thought you needed 50 m4.2xlarges for your app. Maybe what you really need are 25 t2.mediums (a $14k a month difference). I have seen gaps that large revealed when running utilization reports for the first time.

You may be thinking “someone who understands their app would never let that happen”, and that is true most of the time. The problem is:

  1. Very few people actually understand the apps they run.
  2. Very few people consider the cost of running those apps as being something they own.
  3. Related to #1, many people don’t have the luxury of running apps that they have built themselves. They are handed someone else’s shitty app and told to keep the lights on.

Getting a good grip on actual utilization tends to also help with conversations like “maybe we should spend the time to make our app stateless so we can auto-scale servers” or “maybe it’s time to convert this app to PaaS.”

Note: If  you do end up switching a bunch of servers to t2-tier, make sure you monitor CPU credits. This tier has a CPU-usage quota and if you go over that, your server is throttled until the credits rebuild. I’ve seen people flip over to t2s thinking “it’s got the same specs as my m4/5” and then wonder why their high-CPU-usage app is crawling a few hours later.

Buy a baseline

An argument that is normally used to justify hybrid cloud is “we have a baseline on-prem, so we’ll just use the public cloud for burst elasticity”, which is reasonable. However, you can do the same thing in-cloud.

Once you get a grip on your baseline needs, reserved instances (RIs) become an option. It is entirely realistic to cut costs in half with RIs.

RIs take away some flexibility in return for lower instance prices and are similar to financial instruments you see elsewhere in the world, like cell phone plans.

“You can pay month-to-month with no contract, or you can save X% by signing up for a minimum term.”

Unlike a cell phone plan, there is an RI market where you can resell your RIs if you need to get out of them, and there are also lots of options around converting them different instance sizes. Converting RIs to different instance families is also possible, but a bit trickier.

Note: There are limits around selling your RIs and the marketplace doesn’t have a high volume in many regions, so selling opportunities may be limited. So don’t purchase RIs with 100% certainty that you’ll be able to resell them.

RI discounts are primarily tied to term (the length of the contract) and percentage paid up front (none, partial, or full). Three-year full-upfront RIs are going to be the cheapest option.

A trick I’ve learned in the past year is that many resellers like SHI and CDW offer financing for AWS RIs. So instead of paying for 1 or 3 years fully upfront, you can amortize that “upfront” cost over a term with your reseller. Obviously, you’re paying interest as part of the financing, but if the decision was between a 10%-off no-upfront RI purchase and a 60%-off full-upfront purchase, paying a few points in interest is a no-brainier.

Fun fact: RIs can also be purchased for RDS instances and this is almost always an easy win because of how “permanent” your database servers likely are in comparison to the rest of your infrastructure.

Spot instances are your friend

Assuming you’ve purchased a baseline of RIs, any additional instances above that will be either at the “on-demand’ or “spot instance” price.

With spot instances, you are effectively bidding on AWS’ excess capacity, so pricing is highly dynamic based on demand. Because ‘spots’ are excess, they will be pulled back into AWS’ inventory when that capacity is needed by other users paying the on-demand or RI rate.

Spot instances are awesome when they’re feasible for what you’re doing (test environments, fault-tolerant apps, EMR processing, like Spark). Where you might get an RI for 50% off, I’ve seen spot instance discounts as high as 80-90%.

Some folks have done an amazing job of analyzing historical spot prices for the types of instances they need and have purchasing algorithms that help them both track price trends and buy & run their instances when it’s cheapest to do so.

Note: A word of caution here – spot price purchasing is something you want to keep an eye on. Because it’s demand-driven, spot pricing can sometimes (though it’s not often) spike above the price of on-demand pricing. So you’ll want to make sure you’ve accounted for that in any automation you build.

You’re not stuck

A common assumption that people make with their EC2 environments is that the cost “is-what-it-is” until the apps running there can be sunset or migrated to PaaS. Fortunately, that is rarely true. By leveraging some financial tools in your config, it’s highly likely you’ll be able to bring the cost of your EC2 environment down.

If you’re working in enterprise, this stuff will make the biz folks love you, and they’re likely who control your salary.

If you’re running a startup, this stuff can make the difference between you being able to hire more people, or even stay in business at all.

None of it’s hard, it just needs to be accounted for.


How to leave a job the right way

Leaving a job is hard, and oddly enough, it gets harder each time you do it.

When you’re new to the workforce, you don’t know enough to appreciate what you might be leaving behind or worry about what you might be going to.

As you get older, that changes. You know how hard it is to build relationships and find your place. You know how stressed out you’ll be for the first six months in the new job, fighting against imposter syndrome and the uncertainty of new territorial politics and personalities.

If you’ve been a manager, you know the heavy feeling in your gut when someone gives you their notice and it’s impossible not to think about making someone else feel that.

You’ll lay awake at night trying to navigate to the “right” decision. You’ll worry about being happy, about your family being happy, about your coworkers and their families being happy. You’ll worry and stress over things you’ve never even thought of before.

It’s painful. It sucks, but hopefully, it makes you handle things the right way. It may even change how you approach jobs altogether because leaving a job the right way starts on your first day of work.

Say “no” to ego

I know people who have given months’ worth of notice when quitting. Most of the time, this was coming from a good place. They cared about how their coworkers would be affected and wanted to make sure business could go on as usual with minimal pain.

It’s impossible for me to not see the role of ego in giving that much notice though. Even if it’s coming from a good place, there’s an element of “The work I do is so important/complex/special/whatever, that it requires weeks/months of additional work to get someone else even minimally ready to take over.”

I say this as someone who has given a month’s notice before. If I’m honest about it in hindsight, that’s exactly the reason I gave so much notice. Well, that, and I hadn’t done the right things to prepare for me leaving along the way.

Generally, if you leaving a job requires weeks worth of knowledge transfer sessions and documentation work, you have done something wrong. You haven’t made yourself dispensable.

Prepare to leave every day

Don’t wait until two weeks before you leave to start removing yourself as a critical path to getting things done. Even if you never plan to leave, do you really want that anyway? The only award you can count on getting from being indispensable is being permanently on-call.

  • When you design something, document the high-level design and keep those docs updated when you make changes.
  • When you write code, write it as if you are immediately handing it over to someone else to maintain. And then actually hand it off to someone else instead of making it your baby.
  • When you setup apps and infrastructure, and anything else that has “access”, make sure someone other than you has access too.
  • Communicate. Make sure the team knows what you’re working on and how it works. You’re not bragging, you’re keeping them informed. They’ll appreciate it when they get tasked with taking on the things you’ve worked on.
  • If you’re worried about job security, focus on being good at what you do, not keeping secrets. Being “good” is way more effective at securing your job than acting like a hermit wizard.

The last one is a big one. When you do eventually leave, what feels better? That others miss your input and capabilities, or that they think “I sure wish they would have written more stuff down”?

Where you’ll end up

Doing these things will make your day-to-day job better (you’ll have more time to work on interesting things instead of solving the same problems) and it will make your eventual departure a lot more pleasant.

Imagine a scenario where you give two weeks notice and aside from having meetings with managers who don’t want you to leave, you spend the rest of your time just doing your regular job.

You’re not doing knowledge transfer. You’re not scrambling to document things. You’ve already done all that. So, instead, you get to write a little more code and focus on saying goodbye.

This may not leave you feeling completely at peace with leaving for a new job, but it will take away a lot of anxiety. You’ll leave feeling accomplished and confident that you handled things “the right way”. You’ll have maintained good relationships with your coworkers and bosses. You may even have built a safety net, a place you could go back to if you needed or wanted to.


Hype isn’t a use case

A few months ago, a recruiter sent me a LinkedIn message with a link to a recruitment video.  I’ve been seeing more of these lately, but this one was particularly impressive at how quickly it turned me off from the company.

It described their idea of a tech utopia, a place where developers could use whatever technology they desired.

“Read a blog post about something new in the morning. Deploy it to prod in the afternoon.” said a developer vibrating from over-caffeination.

“Pick the tools and languages you want. Every team is using something different.”

“We’re on the cutting edge, innovating like crazy, using tech you won’t see anywhere else.”

Everything about the culture they described seemed terrible.  Sure, the people looked happy and engaged, but what the video communicated to me was “This is a portal into a chaotic hellscape of pain and suffering. God help their Ops team.”

These were kids playing with toys: not developers. Look closely, and you’ll see one or two of these folks in most dev groups. They pick their tools based on the latest blog post they’ve read, chasing technologies like a brain-damaged squirrel.

They’re either not in the on-call rotation or are OK with not having a life outside of work. The things they build are always fragile or in some way broken.

It’s a new package manager!

Corralled by a good manager, these guys (and they are always guys) can be solid members of the team who do, in fact, drive innovation and get people thinking about new ways of doing things. Left to their own devices, though, as they often are, and pretty soon the building will be on fire. Granted, it may be a spectacular fire.

Sometimes, this shiny-penny chasing is actively encouraged (as at the business in the recruitment video). These are places where leadership and staff have confused tooling innovation with business innovation. They have a culture where no one ever asks “Will rewriting our entire app in Haskell give us a competitive advantage?” (The answer to this is always no.)

Nor do they ask:

  • Who else is using this?
  • Where would we get support?
  • Is it well documented?
  • Does this actually do what we think it will do?

And so on.

I’m not arguing that people shouldn’t get excited about new technologies, just that there needs to be some prudence when picking tools.  It’s better to innovate in the business, in the product and processes, not the tooling.

For the business to succeed (and for you to have a job long-term), it’s better to use proven technologies that have had time to bake, develop community support, and for the crazy early adopters to figure out production problems. “Well-documented and supported” is way more important than shiny.

So is “right”.  Established tools with a lot of hype that don’t fit the problem being solved are just as dangerous as the new-born ones.

If you’re on a team with or managing some tinkerers, ask questions and help guide them towards good decisions. Make them think through scenarios like “I just read about MongoDB and it’s awesome. We should use Mongo!” and arrive at “All of our data is relational. We should NOT use Mongo.”

Help them think like a carpenter. Using an old hammer doesn’t keep you from building an awesome house. In fact, having tools you know and can trust will let you take risks elsewhere and push boundaries you might not have otherwise been able to if all your tools are in 15 different variations of “on fire”.

Play with new tech. I’d even argue that there needs to be sprint time dedicated to testing and playing with new stuff. Keep pushing forward and learning, but for what you put in production, pick the boring stuff, the tech you know you can count on.

Resources like Thoughtworks’ Tech Radar are really useful for figuring out which technologies are in the sweet spot of new enough to be relevant but established enough that you won’t be out on an island. If you’re in the enterprise space, resources like Gartner are also handy.

This is the stuff that separates kids from adults, and the successful from the burnt-out husks. Choosing technologies responsibly isn’t sexy or exciting, but it’s wise, which is discounted far too often. It also shows kindness to your teammates and co-workers – the people who are ultimately going to have to deal with the downstream effects of the decisions you make.

Tech TIL

TIL: How to use NumPy

I’ve been trying to flesh out my Python knowledge and learn more about machine learning(ML) in the process. Most of my day-to-day Python use is focused on text manipulation, API calls, and JSON parsing, so leveling up on ML is more math (specifically stat-related) than I’m used to.

Today I played around with the NumPy Python package a bit and figured out some simple things.

For example, if I wanted to multiply the numbers in two lists with vanilla Python, like this:

a = [1, 2, 3, 4] 
b = [4, 3, 2, 1]

print(a * b)

I’d get TypeError: can’t multiply sequence by non-int of type ‘list’ . I’d have to write something to iterate through each list.

NumPy, on the other hand, can handle this like a champ. And this is probably the simplest thing you could use it for.

import numpy

a = [1, 2, 3, 4]
b = [4, 3, 2, 1]

new_a = numpy.array(a)
new_b = numpy.array(b)

print(new_a * new_b)

> [4 6 6 4]

NumPy really shines when you start dealing with multidimensional lists and stat work.


> array([[1, 2, 3, 4],
       [4, 3, 2, 1]])

And then it’s just turtles all the way down. You can slice intersections, calculate standard deviations, and so on. It’s a handy Python package that I knew literally nothing about prior today and a nice tool to add to the toolbox.

Tech TIL

TIL: How to use list comprehensions in python

In the past, if I wanted to make a new list by pulling values out of an existing list based on a condition I would have done something like:

def listItems():
    a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
    new = []
    for num in a:
        if num % 2 != 0:
    print new

But, I figured out via that list comprehenisions can dramatically compress these functions while still maintaining readability. Here’s an example of a list comprehension that would render the same output as the expanded function above:

def listComp():
    a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
    new = [ num for num in a if num % 2 != 0] 
    print new


The syntax is a little weird because python has so little structure to it “num for num in a…”, but makes more sense if you’re referencing a tuple, where it would be “( 1, 2 ) for num in a…”



Design your systems as if they’re already on fire

You’re walking everyone through the postmortem for last night’s outage. You’ve described the issue, you’ve detailed the root cause and remediation, and the team is about to talk through lessons learned and action items.

Someone (probably your manager) asks, “How can we make sure this never happens again?”

Pause here before you respond. There are, at a minimum, two ways to interpret this question:

  1. The literal interpretation – How can we prevent this specific failure from reoccurring?
  2. The pessimistic interpretation – How can we prevent the impact of this failure from reoccurring?

Each interpretation requires a different answer and approach.

The literal interpretation is what I would call the “TSA Strategy”. You’ll be front-loading all your counter-measures in an attempt at preventing the the very specific failure that’s just occurred from reoccurring. It’s a rigid approach that doesn’t adapt easily and tends to sprawl, but it may pacify some questioners. “No one will bring a shoe bomb through security ever again! Or at least very rarely!”

The pessimistic interpretation is more in line with the “NASA Strategy”. It assumes some level of failure instead of trying to achieve 100% prevention. You’ll be implementing counter-measures that are more focused on recovery than prevention. Because of that, this approach is adaptable to many different situations, but it may make people uncomfortable. “The rocket engine may stall, but it recovers quickly.”

These approaches also focus on different metrics. The literal interpretation should track mean time between failures (MTBF). In turn, the pessimistic interpretation should track mean time to recovery (MTTR).

Which interpretation are you going with?

If you’re an NT4 sysadmin in 1997, option 1 all the way! The systems you manage may be large, but they’re not necessarily that complex. Server clustering tech is fairly new, so you’re likely not running any clusters, and your network probably isn’t even connected to the internet. There’s a technological limit to how complex your systems can be.

If you’re an engineer trying to manage cloud-scale infrastructure and applications in 2017, the literal interpretation of your co-worker’s question simply doesn’t work. The systems you manage are too complex to keep piling on preventative measures, you’ll never be able to wrap your arms around it all. There are too many variables to account for.

So you need to be pessimistic. You need to assume that anything and everything can and will fail. Get comfortable with that idea and change your approach to failure.


First, you need to work on building an organizational, or at a minimum, a team consensus that trying to prevent all failure in your systems is insane. Trying to do this all on your lonesome will be crazy-making. So start the internal sales-pitch as quick as you can.

Second, identify what the “impact” of a system or component failure means. Does it mean 404s for the end-user? Missing transactions? Data loss? What is it you are actually trying to prevent?

Third, you need better visibility into your systems and applications than you likely already have, because the required toolsets have only recently become available (unless you work somewhere that is big enough that you’ve built your own tooling for this). Products like New Relic Insights and Honeycomb are purpose-built for the analysis of complex systems.

Traditional logging and metrics only track the failure conditions you know about. The newer generation of tooling is built around data discovery and anomaly detection, helping you find patterns you wouldn’t have seen otherwise and build mitigating controls and automated adaptations.

Fourth, if you’re not already comfortable with automation (and coding), you need to get there. The infrastructure of complex systems needs to monitor itself and self-heal, but it also needs to be somewhat aware of the applications that are running on-top of it so that it can help them self-heal as well. You likely won’t be able to achieve this fully with only the parts that come in the box.

Now that you’ve at least got these things in flight…

An example

If you’ve shifted your thinking from “How do we prevent a 404 from ever occurring?” to “How should the system respond when it begins to detect 404s?”, you’re on the right track. Let’s keep going with this simple scenario.

  • Maybe your system steers users to healthy nodes where 404s aren’t occurring and spins up replacement capacity.
  • Maybe the system detects something downstream is overloaded so it throttles API calls while the downstream component recovers.
  • Maybe the system knows the last version of code deployed wasn’t generating 404s, so it rolls back to that version and sends an alert.
  • Maybe the system sheds load until it recovers as a whole.

As your org matures in this approach, it might make sense to leverage chaos engineering and intentionally introduce failures into your systems to observe how the system reacts and build preventative and reactive controls to address what you find. “Do we get 404s if we take this service offline? No? Well, it’s the core Nginx router, so that’s weird.”

Note that some of these failure responses are just extensions of traditional prevention tactics. And, to be clear, preventative measures are still needed, they’ve just become more foundational. There should be a base-level of redundancy, but not a never-ending tower of it, which is what most legacy IT departments are currently trying to manage.

There’s a tipping point in systems complexity where reactive measures are a lot more practical and effective than prevention. This has always been true, it’s just that cloud-scale has forced people to actually acknowledge it.

Tech TIL

TIL: How to pass MySQL queries as variables in Bash

Note: I also learned how to handle query blocks with Bash here documents. Most of what I’ve done in the past with MySQL and Bash has been limited to single-line selects, so Yay!, learning.



user_id=$(mysql -u$db_user -p$db_pass $db_server <<GET_USER_ID
USE main;
SELECT user_id FROM users WHERE username="dave";

echo "Dave's User ID is $user_id"

It’s the small things in life…

Tech TIL

TIL: How to disable core dumps on AWS Linux

I ran across a situation where a previous sysadmin had enabled core dumps on a server that were filling up the root volume. The dumps weren’t needed, so I decided to disable them, problem is, I’ve never really dealt with core dumps because I’ve never had to, so I had to do some Googling.

Here’s the result:
# Only tested with AWS Linux but should work on RHEL, CentOS, Fedora, etc.

echo '*     hard    core    0' >> /etc/security/limits.conf
echo 'fs.suid_dumpable = 0' >> /etc/sysctl.conf
sysctl -p

Line 4 disables core dump creation for users. Line 5 disables setuid itself from generating core files, and Line 6 applies the changes to the running kernel.


How to get started with DevOps

The question that comes up the most when I talk to people who are interested in DevOps is “Where do I start?”

Most of the time, the answer they’re looking for is a list of tools and technologies to learn, so they usually look disappointed when they hear my answer.

If you want to get started in DevOps, if you want to deliver faster and keep up with everything that’s being asked of you without killing yourself in the process, tools and technology are the wrong places to start.

First, you need to change the way you view and manage your work.

Step 1 – Define what matters

What’s the purpose of your work?

Care and feeding of servers, or “developing” might be what you do, but they aren’t reasons in themselves. “Keeping everything working” is almost a reason but mostly in the vein of “I eat to live.”

Chances are, the purpose of your work is a little meta – it’s to support the needs and goals of your employer, to provide value in some way. So what does your employer value? If you don’t know, it’s time to ask and figure out how your work relates.

You’ll arrive at some ideas you can measure and center your work around – things like systems uptime, mean-time-to-resolution, code quality, customer satisfaction, and so on.

This is where you start. Everything going forward is tied to the baselines you set and how you measure your progress against the new metrics you’re now tracking. You’re about to bring some science to your job – observing, measuring, and iterating.

Step 2 – Kanban ALL the things!

Before you start automating and building robots to do all the work, you need to gain visibility to that work.

You don’t have to use kanban, although I would highly recommend it as an easy place to start. (Also, personal kanban is my spirit animal.) But you do need some sort of methodology that gives the entire team visibility to the work backlog and highlights bottlenecks that slow work down or otherwise impact the performance indicators you defined in step one.

This will require that your team starts collaborating, both internally and with other groups.  They may not be used to this. You may not be used to it either, but people have to talk to one another if they aren’t already.

Once the team starts talking about what they’re working on you’ll quickly discover opportunities to quash duplicate work, misplaced work (“Why is our group even handling this?”), and problems that were otherwise going untracked or getting piled on one person.

If you’re a developer and have already been using Scrum or some sort of Agile methodology, some of these concepts might not be completely new to you. Even if that is the case, you may not be accounting for all of the backlog or might only be focused very narrowly on project work.

Step 3 – Automate and iterate

By now you should have a better idea of your team and employer’s goals as well as the work the team needs to accomplish. After dropping items from the backlog that are 1.) of little to no value, and/or 2.) never actually going to be worked on (it’s important to be honest about this), it’s time for some experiments.

Use your kanban to identify the biggest bottlenecks that slow down processing of the backlog and attack them with process and automation. It will be tempting to attack low hanging fruit for easy wins, but it’s important to treat the bottlenecks as if nothing else matters (because it kind of doesn’t).

Note: Quality issues are bottlenecks and should be treated as such. If people have to constantly spend time fixing crappy code, failed builds, or similar issues – that’s getting in the way of other work.

It’s usually OK to throw some smaller improvements into your sprints that fit around other work but automating away the problems that significantly delay work (like waiting for Ops to manually build out environments) is where the focus should be. Most of the other stuff is just distraction.

As you put different processes and automation in place, make note of it and track those events against your performance indicators. Measure. Iterate. Measure again.

Step 4 – Keep learning

It’s hard for me to imagine anyone succeeding with DevOps without regularly dedicating some time to reading. For most, working via DevOps practices will feel a little foreign at first, so the more you know about what others are doing and the ideas that inspired DevOps, the more comfortable you’ll be.

There are a ton of great books that cover DevOps concepts and related topics like lean manufacturing, interpersonal relationships(!), and automation.

Here are a few:

DevOps is a buzzword that gets thrown around to the point that a lot of people think it’s just a fad or marketing-speak. It definitely can be depending on what the person saying “DevOps” means.

In my experience though, the people and process components of DevOps are really powerful and can change your day-to-day work and make things better. Unfortunately, these are the components of DevOps that usually get ignored. People tend to chase the shiny penny of automation without all the front end work that ensures they’re automating the right things.

If you’re interested in “DevOpsing”, I can’t stress enough how important it is to start with steps 1 and 2 before doing anything else. If you do, I can almost guarantee you’ll achieve some level of success, even if it’s just in how you handle your own work.

Tech TIL

TIL: How to get RedShift to access S3 buckets in a different region

While trying to get a Spark EMR job running, I encountered this error on a Spark step that copied data from RedShift to S3.

error: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint.

I’ve seen issues in the past with S3 buckets outside of US-East-1 needing to be targeted with region-specific URLs for REST ( vs. but had not seen anything similar for s3:// targeted buckets.

This got me looking at Hadoop file system references, none of which are helpful because EMR rolls their own, proprietary file system for Hadoop S3 access. So Hadoop’s recommended s3a:// (which is fast and resilient – and supports self-discovering cross-region S3 access!) does not work on EMR. Your only option is s3://, which appears to be region-dumb.

The fix turns out to be simple, you just have to pass the bucket region to the Spark step as a separate argument.  i.e. us-west-2

… simple, but annoying, because the steps worked in a pre-prod environment (in a different region), so it wasn’t immediately apparent what was causing the failure, which was buried in the logs.