5 keys to data success: Preparation

Mik Data Science April 3, 2016

Data preparation is a critical step prior to analysis.
Data preparation is a critical step prior to analysis.

Get your data ready to work for you.

In our first two articles of the series “5 Keys to Data Success” we introduced the initial steps to getting a data initiative started: collecting and managing data. In this article we’ll cover one of the most crucial, and difficult, steps before you can really turn data into insight: preparing your data for analysis.

It comes in many names (e.g., data cleaning, data wrangling, data munging), but here we’ll simply refer to all of those processes as data preparation. Proper data preparation is a tricky matter, tightly coupled with both collection and management. Many simply leave it to data science experts. In fact, some estimate as much as 80% of the work done by a data scientist is in the preparation. But with data scientists commanding high salaries, do you know what to consider to keep your preparation costs to a minimum? Can you skip it entirely? (spoiler: probably not!). Here’s 5 questions you need to consider before diving in:

1. Does my data need to be transformed and combined?

Many companies are collecting data from various sources, in various formats. If you are one of these companies and intend to analyze all the data together, you’ll need to transform and combine the data into consistent, useable formats.

2. Who should be in charge of transforming/combining the data?

Who is the ultimate stakeholder of the data and insights? Be sure to involve them when deciding what should be combined and how it’s transformed. Don’t leave the data scientists to their own devices without business insight.

3. Does my data need cleaning?

It’s rare for data collection to be perfect. Whether data points are garbled, untrustworthy or just simply missing data, they must cleaned properly prior to use. Life can be dirty — but your data is almost guaranteed to be.

4. How much do I care about understanding the preparation (and what’s the budget)?

Obviously, you care about what goes into the preparation. But this is about how much time/money is spent on characterizing the data for the “perfect” preparation. Do you want data preparation that works, is rigorously scientific, or in between?

5. What tools can lower my costs?

From visualization to transformation, lots of tools are out there to help. Or it might even be cost effective to build your own, if preparation procedures are repeatable across datasets. But keep in mind, tools can get very expensive very fast.

There’s an understandable desire to jump straight from collection to analysis, so don’t forget to think about the data preparation — as unglamorous as it is. Without it, your analysis will end up useless. And at 80% of the labor in data science, you can bet preparation is one of the most costly and difficult steps to take. So keep these 5 questions in mind when you get into it. Whether it’s your data or your strategy, preparation is key!