Categories
Tech TIL

TIL: How to use NumPy

I’ve been trying to flesh out my Python knowledge and learn more about machine learning(ML) in the process. Most of my day-to-day Python use is focused on text manipulation, API calls, and JSON parsing, so leveling up on ML is more math (specifically stat-related) than I’m used to.

Today I played around with the NumPy Python package a bit and figured out some simple things.

For example, if I wanted to multiply the numbers in two lists with vanilla Python, like this:

a = [1, 2, 3, 4] 
b = [4, 3, 2, 1]


print(a * b)

I’d get TypeError: can’t multiply sequence by non-int of type ‘list’ . I’d have to write something to iterate through each list.

NumPy, on the other hand, can handle this like a champ. And this is probably the simplest thing you could use it for.

import numpy

a = [1, 2, 3, 4]
b = [4, 3, 2, 1]

new_a = numpy.array(a)
new_b = numpy.array(b)

print(new_a * new_b)


> [4 6 6 4]

NumPy really shines when you start dealing with multidimensional lists and stat work.

numpy.vstack((new_a,new_b))

> array([[1, 2, 3, 4],
       [4, 3, 2, 1]])

And then it’s just turtles all the way down. You can slice intersections, calculate standard deviations, and so on. It’s a handy Python package that I knew literally nothing about prior today and a nice tool to add to the toolbox.

Categories
Tech TIL

TIL: How to use list comprehensions in python

In the past, if I wanted to make a new list by pulling values out of an existing list based on a condition I would have done something like:

def listItems():
    a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
    new = []
    for num in a:
        if num % 2 != 0:
            new.append(num)
    print new

But, I figured out via https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions that list comprehenisions can dramatically compress these functions while still maintaining readability. Here’s an example of a list comprehension that would render the same output as the expanded function above:

def listComp():
    a = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
    new = [ num for num in a if num % 2 != 0] 
    print new

Boom!

The syntax is a little weird because python has so little structure to it “num for num in a…”, but makes more sense if you’re referencing a tuple, where it would be “( 1, 2 ) for num in a…”

 

Categories
Tech

Design your systems as if they’re already on fire

You’re walking everyone through the postmortem for last night’s outage. You’ve described the issue, you’ve detailed the root cause and remediation, and the team is about to talk through lessons learned and action items.

Someone (probably your manager) asks, “How can we make sure this never happens again?”

Pause here before you respond. There are, at a minimum, two ways to interpret this question:

  1. The literal interpretation – How can we prevent this specific failure from reoccurring?
  2. The pessimistic interpretation – How can we prevent the impact of this failure from reoccurring?

Each interpretation requires a different answer and approach.

The literal interpretation is what I would call the “TSA Strategy”. You’ll be front-loading all your counter-measures in an attempt at preventing the the very specific failure that’s just occurred from reoccurring. It’s a rigid approach that doesn’t adapt easily and tends to sprawl, but it may pacify some questioners. “No one will bring a shoe bomb through security ever again! Or at least very rarely!”

The pessimistic interpretation is more in line with the “NASA Strategy”. It assumes some level of failure instead of trying to achieve 100% prevention. You’ll be implementing counter-measures that are more focused on recovery than prevention. Because of that, this approach is adaptable to many different situations, but it may make people uncomfortable. “The rocket engine may stall, but it recovers quickly.”

These approaches also focus on different metrics. The literal interpretation should track mean time between failures (MTBF). In turn, the pessimistic interpretation should track mean time to recovery (MTTR).

Which interpretation are you going with?

If you’re an NT4 sysadmin in 1997, option 1 all the way! The systems you manage may be large, but they’re not necessarily that complex. Server clustering tech is fairly new, so you’re likely not running any clusters, and your network probably isn’t even connected to the internet. There’s a technological limit to how complex your systems can be.

If you’re an engineer trying to manage cloud-scale infrastructure and applications in 2017, the literal interpretation of your co-worker’s question simply doesn’t work. The systems you manage are too complex to keep piling on preventative measures, you’ll never be able to wrap your arms around it all. There are too many variables to account for.

So you need to be pessimistic. You need to assume that anything and everything can and will fail. Get comfortable with that idea and change your approach to failure.

Pre-requisites

First, you need to work on building an organizational, or at a minimum, a team consensus that trying to prevent all failure in your systems is insane. Trying to do this all on your lonesome will be crazy-making. So start the internal sales-pitch as quick as you can.

Second, identify what the “impact” of a system or component failure means. Does it mean 404s for the end-user? Missing transactions? Data loss? What is it you are actually trying to prevent?

Third, you need better visibility into your systems and applications than you likely already have, because the required toolsets have only recently become available (unless you work somewhere that is big enough that you’ve built your own tooling for this). Products like New Relic Insights and Honeycomb are purpose-built for the analysis of complex systems.

Traditional logging and metrics only track the failure conditions you know about. The newer generation of tooling is built around data discovery and anomaly detection, helping you find patterns you wouldn’t have seen otherwise and build mitigating controls and automated adaptations.

Fourth, if you’re not already comfortable with automation (and coding), you need to get there. The infrastructure of complex systems needs to monitor itself and self-heal, but it also needs to be somewhat aware of the applications that are running on-top of it so that it can help them self-heal as well. You likely won’t be able to achieve this fully with only the parts that come in the box.

Now that you’ve at least got these things in flight…

An example

If you’ve shifted your thinking from “How do we prevent a 404 from ever occurring?” to “How should the system respond when it begins to detect 404s?”, you’re on the right track. Let’s keep going with this simple scenario.

  • Maybe your system steers users to healthy nodes where 404s aren’t occurring and spins up replacement capacity.
  • Maybe the system detects something downstream is overloaded so it throttles API calls while the downstream component recovers.
  • Maybe the system knows the last version of code deployed wasn’t generating 404s, so it rolls back to that version and sends an alert.
  • Maybe the system sheds load until it recovers as a whole.

As your org matures in this approach, it might make sense to leverage chaos engineering and intentionally introduce failures into your systems to observe how the system reacts and build preventative and reactive controls to address what you find. “Do we get 404s if we take this service offline? No? Well, it’s the core Nginx router, so that’s weird.”

Note that some of these failure responses are just extensions of traditional prevention tactics. And, to be clear, preventative measures are still needed, they’ve just become more foundational. There should be a base-level of redundancy, but not a never-ending tower of it, which is what most legacy IT departments are currently trying to manage.

There’s a tipping point in systems complexity where reactive measures are a lot more practical and effective than prevention. This has always been true, it’s just that cloud-scale has forced people to actually acknowledge it.

Categories
Tech TIL

TIL: How to pass MySQL queries as variables in Bash

Note: I also learned how to handle query blocks with Bash here documents. Most of what I’ve done in the past with MySQL and Bash has been limited to single-line selects, so Yay!, learning.

#!/bin/bash

db_server=servername.local
db_user=millhouse
db_pass=derpderpderp

user_id=$(mysql -u$db_user -p$db_pass $db_server <<GET_USER_ID
USE main;
SELECT user_id FROM users WHERE username="dave";
GET_USER_ID
)

echo "Dave's User ID is $user_id"

It’s the small things in life…

Categories
Tech TIL

TIL: How to disable core dumps on AWS Linux

I ran across a situation where a previous sysadmin had enabled core dumps on a server that were filling up the root volume. The dumps weren’t needed, so I decided to disable them, problem is, I’ve never really dealt with core dumps because I’ve never had to, so I had to do some Googling.

Here’s the result:

disableCoreDumps.sh
#!/bin/bash
# Only tested with AWS Linux but should work on RHEL, CentOS, Fedora, etc.

echo '*     hard    core    0' >> /etc/security/limits.conf
echo 'fs.suid_dumpable = 0' >> /etc/sysctl.conf
sysctl -p

Line 4 disables core dump creation for users. Line 5 disables setuid itself from generating core files, and Line 6 applies the changes to the running kernel.

Categories
Tech

How to get started with DevOps

The question that comes up the most when I talk to people who are interested in DevOps is “Where do I start?”

Most of the time, the answer they’re looking for is a list of tools and technologies to learn, so they usually look disappointed when they hear my answer.

If you want to get started in DevOps, if you want to deliver faster and keep up with everything that’s being asked of you without killing yourself in the process, tools and technology are the wrong places to start.

First, you need to change the way you view and manage your work.

Step 1 – Define what matters

What’s the purpose of your work?

Care and feeding of servers, or “developing” might be what you do, but they aren’t reasons in themselves. “Keeping everything working” is almost a reason but mostly in the vein of “I eat to live.”

Chances are, the purpose of your work is a little meta – it’s to support the needs and goals of your employer, to provide value in some way. So what does your employer value? If you don’t know, it’s time to ask and figure out how your work relates.

You’ll arrive at some ideas you can measure and center your work around – things like systems uptime, mean-time-to-resolution, code quality, customer satisfaction, and so on.

This is where you start. Everything going forward is tied to the baselines you set and how you measure your progress against the new metrics you’re now tracking. You’re about to bring some science to your job – observing, measuring, and iterating.

Step 2 – Kanban ALL the things!

Before you start automating and building robots to do all the work, you need to gain visibility to that work.

You don’t have to use kanban, although I would highly recommend it as an easy place to start. (Also, personal kanban is my spirit animal.) But you do need some sort of methodology that gives the entire team visibility to the work backlog and highlights bottlenecks that slow work down or otherwise impact the performance indicators you defined in step one.

This will require that your team starts collaborating, both internally and with other groups.  They may not be used to this. You may not be used to it either, but people have to talk to one another if they aren’t already.

Once the team starts talking about what they’re working on you’ll quickly discover opportunities to quash duplicate work, misplaced work (“Why is our group even handling this?”), and problems that were otherwise going untracked or getting piled on one person.

If you’re a developer and have already been using Scrum or some sort of Agile methodology, some of these concepts might not be completely new to you. Even if that is the case, you may not be accounting for all of the backlog or might only be focused very narrowly on project work.

Step 3 – Automate and iterate

By now you should have a better idea of your team and employer’s goals as well as the work the team needs to accomplish. After dropping items from the backlog that are 1.) of little to no value, and/or 2.) never actually going to be worked on (it’s important to be honest about this), it’s time for some experiments.

Use your kanban to identify the biggest bottlenecks that slow down processing of the backlog and attack them with process and automation. It will be tempting to attack low hanging fruit for easy wins, but it’s important to treat the bottlenecks as if nothing else matters (because it kind of doesn’t).

Note: Quality issues are bottlenecks and should be treated as such. If people have to constantly spend time fixing crappy code, failed builds, or similar issues – that’s getting in the way of other work.

It’s usually OK to throw some smaller improvements into your sprints that fit around other work but automating away the problems that significantly delay work (like waiting for Ops to manually build out environments) is where the focus should be. Most of the other stuff is just distraction.

As you put different processes and automation in place, make note of it and track those events against your performance indicators. Measure. Iterate. Measure again.

Step 4 – Keep learning

It’s hard for me to imagine anyone succeeding with DevOps without regularly dedicating some time to reading. For most, working via DevOps practices will feel a little foreign at first, so the more you know about what others are doing and the ideas that inspired DevOps, the more comfortable you’ll be.

There are a ton of great books that cover DevOps concepts and related topics like lean manufacturing, interpersonal relationships(!), and automation.

Here are a few:

DevOps is a buzzword that gets thrown around to the point that a lot of people think it’s just a fad or marketing-speak. It definitely can be depending on what the person saying “DevOps” means.

In my experience though, the people and process components of DevOps are really powerful and can change your day-to-day work and make things better. Unfortunately, these are the components of DevOps that usually get ignored. People tend to chase the shiny penny of automation without all the front end work that ensures they’re automating the right things.

If you’re interested in “DevOpsing”, I can’t stress enough how important it is to start with steps 1 and 2 before doing anything else. If you do, I can almost guarantee you’ll achieve some level of success, even if it’s just in how you handle your own work.

Categories
Tech TIL

TIL: How to get RedShift to access S3 buckets in a different region

While trying to get a Spark EMR job running, I encountered this error on a Spark step that copied data from RedShift to S3.

error: S3ServiceException:The bucket you are attempting to access must be addressed using the specified endpoint.

I’ve seen issues in the past with S3 buckets outside of US-East-1 needing to be targeted with region-specific URLs for REST (https://bucketname.s3.amazon.com vs. https://bucketname.us-west-2.s3.amazon.com) but had not seen anything similar for s3:// targeted buckets.

This got me looking at Hadoop file system references, none of which are helpful because EMR rolls their own, proprietary file system for Hadoop S3 access. So Hadoop’s recommended s3a:// (which is fast and resilient – and supports self-discovering cross-region S3 access!) does not work on EMR. Your only option is s3://, which appears to be region-dumb.

The fix turns out to be simple, you just have to pass the bucket region to the Spark step as a separate argument.  i.e. us-west-2

… simple, but annoying, because the steps worked in a pre-prod environment (in a different region), so it wasn’t immediately apparent what was causing the failure, which was buried in the logs.

Categories
Tech TIL

TIL: How to get JBoss AS to gracefully handle database server failovers

Today I learned how to get JBoss AS (Wildfly) to not lose its mind during a database server failover event (as occurs in the cloud quite often).

The config is simple, but finding comprehensive documentation is a bit challenging since most current JBoss docs require a RedHat subscription. So you’re stuck piecing things together from multiple sites that contain tidbits of what’s needed.

Note: This is for a DBaaS scenario (think AWS RDS or Azure SQL Database), where the DB load balancing is done downstream of your app server. If you’re specifying multiple servers per connection, you’ll have to do some Googling of your own.

Otherwise, you’ve probably got a datasource (DS) connection defined in standalone.xml (or elsewhere depending on your deployment) that looks sort of like this:

<datasource jndi-name="java:jboss/datasources/defaultDS" enabled="true" use-java-context="true" pool-name="defaultDS" use-ccm="true">
    <connection-url>jdbc:mysql://SQL_DB_SERVER_URL:3306/DB_NAME</connection-url>
    <driver>mysql_version_derp_derp_derp.jar</driver>
    <security>
        <user-name>USERNAME</user-name>
        <password>PASSWORD</password>
    </security>
</datasource>

Adding these options make JBoss handle failovers a bit better.

<datasource jndi-name="java:jboss/datasources/defaultDS" enabled="true" use-java-context="true" pool-name="defaultDS" use-ccm="true">
    <connection-url>jdbc:mysql://SQL_DB_SERVER_URL:3306/DB_NAME?autoReconnect=true</connection-url>
    <driver>mysql_version_derp_derp_derp.jar</driver>
    <security>
        <user-name>USERNAME</user-name>
        <password>PASSWORD</password>
    </security>
    <validation>
        <check-valid-connection-sql>SELECT 1</check-valid-connection-sql>
        <background-validation>true</background-validation>
    </validation>
    <pool>
        <flush-strategy>IdleConnections</flush-strategy>
    </pool>    
</datasource>

Checkout lines 2 and 8-14.

On line 2 we’ve added the autoReconnect=true option to the connection-url. This does exactly what it says. If a database connection attempt fails, JBoss will attempt to re-establish the connection instead of sulking in a corner like it does by default. But it needs a way to know that it should reconnect…

On lines 8-11, we’ve added a connection validation. I believe some DS drivers handle this on their own, but the MySQL JDBC drivers I’ve tested appear not to. This seems to be the standard workaround from what I could find but does have the downside of issuing wasteful queries on the DB. The “background-validation” setting helps a little by issuing the validation checks in a separate thread.

This section should force JBoss to drop dead DB connections instead of letting them gum up the pipes.

Lines 12-14 help with the same problem. By default, JBoss is supposed to flush stale connections (that’s what the docs say at least), but this doesn’t seem to always happen in practice. Using IdleConnections should take care of any failed connections that aren’t getting flushed or EntirePool can be used if you want to be really aggressive.