All your data is garbage

Storage is cheap, especially in the cloud where you can pay pennies per geo-redundant gigabyte. For someone who once paid $200+ to replace their 400 megabyte drive with a 1000 megabyte drive, that’s almost unbelievable.

No more expensive tape libraries. No more hiring couriers to ship data to an offsite vault. No more budget-busting SANs. No more uninstalling 1001 Amazing Fonts to free up space for Warcraft: Orcs vs. Humans.

Cheap led to less worry and it changed the way businesses approached data. Keep it all. Keep it forever. We might need it later.

Oh, what a terrible idea.

Building a liability factory

“We’re keeping everything to protect ourselves.” is a flawed approach to data retention. Even if your business is 100% ethical, the idea that “We’ve got nothing to hide and all this data will only help.” is misguided and naive.

  1. No one has complete control of their narrative. Sliced one way (your way), all that data may prove you’re in the right. But the other side’s lawyer(s) is incentivized to bring their own narrative and context to the data. It’s often better to have the legal minimum data rather than a rich data set that someone can twist, re-shuffle, and turn against you.
  2. If you have it, you’ve got to provide it. This isn’t an issue because you’ve got something to hide, but it does become costly and unwieldily. Storage is cheap, but managing it is not. Every retention and recovery process you put in place takes time and adds a burden on IT staff. If you’re trying to keep IT cheap, having 10 FTEs just to process data requests from Legal is not a good place to start.
  3. If you have it, you’ve got to protect it. Consumer facing businesses are terrible with this one. It might be really awesome to know the hair color of all your customers, but every bit of data you collect related to your customers, employees, and business partners needs to be stored as securely as possible at significant cost and risk. Less data, less risk.

Caveat: I’m no lawyer, but I used to watch a lot of Law & Order and know several words in Latin. Habeas Corpus Callosum.

99% of your data is completely worthless

Unless you measure your corporate data in petabytes and employ your own data scientists, Big Data isn’t a thing for your business.

On the other hand, Business Intelligence might be, and a lot of businesses get great insight from analyzing their data and building dashboards. But the successful ones all have one thing in common – they know what questions they want to ask before they start collecting data.

The answers to those questions may lead to new questions that require more data to be collected and retained, but it’s better to start with the goal of “Here’s what we need to know.” instead of “Gee, I wonder what all this data could tell us.” The first is a path to making practical decisions, the other leads to pretty, but useless charts.

The pitfall, even with BI, is making the assumption that you are gaining predictive power. Gaining insight into What Has Happened doesn’t necessarily unlock the ability to accurately predict What Will Happen (although it is generally helpful). It all depends on the data model, and good models are ridiculously hard to create and maintain.

There are instances where massive data sets make prediction even more difficult, especially when you don’t have the right people in place who understand statistical baselining and the effects of specificity.  A smaller, intelligently analyzed data set may be far more useful to you.

Ask “Why?” and push back

Just because you can do something, doesn’t mean you should and there are only a few good reasons for businesses to retain data. Three, in particular, serve as a good sniff test.

  1. Regulation & Compliance – There is a specific legal requirement to store a specific data set for a specific amount of time.
  2. Real Need – The business literally needs the data to function – think payroll and financials.
  3. Quantitative Value – You already have a set of questions you want to ask your data.

Requests for data retention that don’t pass the sniff test tend to lead down non-sensical rabbit holes. Unless your business is, literally, data – you should be keeping as little data as possible.

Cheap as storage might be, between liability, opportunity cost, and management cost, all the data you’re storing could wind up being very expensive for your business.

Image Credit: United Nations