Tags

,

Analysis can be intense. Lots of data and metrics. Big data. Lots of tools and data sources. Analysis can be time consuming and need excel gymnastics. Worst case, lots of time spent but no useful insights. Some of this is expected because doing analysis is like a detective finding clues. Nevertheless, it’s useful to have a plan. Here’s my plan – what I am calling a trail map – for data analysis (if this post is too basic, read from the greats* listed at the end).

First a few caveats – (1) this trail map is a continual work in progress and I will update here periodically (2) it’s in an internet startup context where velocity matters and experimentation is possible (3) majority of my analysis is excel but I use command line and python/SQL on occasion for cutting big data to size.

With that, here’s my trail map for data analysis:

  1. Marinate in the question (or problem statement or goals)
  2. Play with the raw data. Play in the real world.
  3. Expect to clean the data … a lot
  4. Produce the headline stats – overall and for segments
  5. Put numbers in perspective – baselines, normalizing, comparables, changes, and trends
  6. Mock up charts that answers the question. Iterate. Make the point.

More on each topic below.

 

1. Marinate in the question (or problem statement or goals)

Someone smart said “the best solution lives in the best definition of the problem” (Tell me if you know who). So marinate in the question till it is fully understood.

  • What outcomes does the business want? What is the O in OKR?  
  • What is the context of this question?
  • What sub-questions can you ask? Asking the 5 Whys works here.
  • What questions are a proxy to this question? Can this question be refined or reduced to an easier (or different) question?
  • What’s a guess – ahem, hypothesis – to the answer? What might a chart that is an answer to the question look like? (See section #6 on Mock up charts below)

 

2. Play with the raw data. Play in the real world.

Play with the raw data in the excel file.

The same can be done with TSV or SQL result sets. Doing so gets you closer with the data and you can see all its beauty and its warts that could open up creative possibilities.

  • Columns: What are the column headings? Which columns have category data, which have segments or dimensions, and which have metrics? Are columns independent or do some depend on or calculated from other columns?
  • Rows: What could make a row unique? Are there duplicate rows? How many rows in total?
  • Values: Any blank or empty values? What’s the range of values in a numerical column? What is the list of categorical values? What’s most frequent?
  • Filters: Filter columns to get some idea of segments.
  • Sort:  This helps to get an idea of the size and order in the data. Important – first make a copy, or add an additional column that hard codes the original order because there could be a clue in the original order.

Get out of excel. And play in the real world.

This is to understand how real users (or devices) interact with the website or app or physical location that is producing all the raw data. The reason to do this is to understand the context of the data and build hypotheses on why something is happening – just like a detective inspects the crime scene for clues.

  • Open websites in a browser and notice the elements on the page.
  • Go through the purchase process. Buy something that’s typical for the website and cheap and useful for the office – you can generally expense it.
  • Download and play with apps.
  • Sign up as an Uber driver.
  • Go to the physical store and walk the aisles.

 

3. Expect to clean the data … a lot

First check if the data is utter garbage

Big data is mostly noise and less signal, and the fun is in getting to the signal. However, you could get data that is utter garbage due to a missing SQL, or an erroneous segment, or – wait for it – the wrong data store. So, first thing, think about what’s the one quick analysis (or handful if you wish and I mean handful) that will tell whether this is utter garbage. Often simple summary stats such as total revenue, total customers, etc, and checking if they make sense will tell if the data is or is not garbage. For time-series data, consider trending some metrics to see if that’s reasonable given seasonality or other business context.

Consider measurement error or bias in the data

For example if a pixel tracker stops recording or records duplicate events, there is measurement error. An example of measurement bias could be due to the type of people who are responding to a survey and how the survey was conducted. Ideally never delete or update the raw data, instead add additional columns to the far right to exclude rows or update values. Finally, consider if the data is likely to be consistently error prone or consistently biased. If so, it may be ok to use it for comparisons and trending.

How clean is clean

Even Purell® says it kills 99.99% of germs, not all. Or in our case how much bad data can you kill before you have nothing left for a thorough analysis. In the non-extremes, it becomes a judgment call. To give some idea of how I think about it, if I have 80-90% or more of the data, I’d go ahead with the analysis. If I have 50-80% of the data, I’d probably consider resulting insights as directional or look for workarounds to get to an answer. At less than 50% I’d likely go back and thoroughly understand the data collection process.

What might change if you had perfect data

Suppose the analysis was done on the cleaned data, what would happen if you actually got perfect data. Are your analyses or summary likely to change? If so, which direction? And most importantly, would that change the recommendations to achieve business outcomes? Thinking about what-if scenarios can help boost confidence in the recommendation despite lack of perfect data.

 

4. Produce the headline stats – overall and for segments

Have you noticed that general news articles have a single number in the headline? The benefit of this is it is memorable. It is also easier to appreciate the number’s size and in some cases, its shock value. These headline stats (a.k.a. descriptive statistics) come in handy later for sanity checking, to prioritize, and to avoid running down rabbit holes in the pursuit of minutia.

Do this for the overall e.g. Total Revenue, Total Population, etc. Also partition the data into segments in a MECE way. The segments could be traffic sources, page types, queries, stores, metro areas, verticals, departments, user personas, etc. Excel Pivot tables and Python Pandas are good for this. The thing to remember about segments such as user personas is that personas may not neatly fit into segments or may overlap – such assumptions can be tested.

 

5. Put numbers in perspective – baselines, normalizing, comparables, changes, and trends

Putting numbers in perspective is popular. For example, volume specified in Olympic swimming pools, area in football fields, data size in Libraries of Congress, etc. The reason to do this is to make numbers more understandable and accessible to the reader.

Transferring this concept to data analysis – What does traffic lift mean in terms of days of revenue? What does productivity improvement mean in terms of satisfied customers? What share of total size is our headline number? There are trustworthy data sources for putting numbers in perspective – Google Trends, US Census, Eurostat, McKinsey Global Institute, United Nations, etc.

Year on Year is quoted often. So, does this mean last year is always a good baseline for comparison? How did the market do year on year? Would you expect your site to do better or worse year on year relative to market? (By the way, what is the market?) Is it doing worse because last year was a not a good baseline because there was a major promotion or a product launch?

A great example of normalization is the Big Mac index. Then there are the statistical normalizations – z-score and coefficient of variation. Then there are the simpler ones.  Comparing different time periods? Normalize to daily average. Comparing cities? Normalize to population or income level.

A few other things: When comparing, calculate whether the difference is statistically significant. Extra credit for doing this. I will find some online tools for this or put up my excels for hypothesis testing. When showing percentages, show the sample size n or the absolute values also. If the percentage is a share, then show the total of 100% so it is clear what sums to 100%.

 

6. Mock up charts that answers the question. Iterate. Make the point.

Mocking up a chart first helps to visualize the end product and what analysis to prioritize to deliver that end product.

  1. Start with a drawing on a whiteboard or a sheet of paper. Look at common charts that are in Excel. For location or geographic data, strongly consider maps.
  2. Build the data table and put in as many values as possible. If you don’t have data, use something like t-shirt sizes to estimate relative sizes.
  3. Try to make the mock graph as much to scale as possible and as much with real data
  4. Add annotations and labels and footnotes

Iterations are needed because as the graph gets built up with real data, more questions (see #1 and #2) or more clean up (#3) may be needed. For iterations, be sure to save copies. I also recommend stepping away from Excel and going for a walk to clear the head.

Finally to make the point, figure out the story from the graphs. Then tell the story and let the graphs be the exhibits. The point needs to be in plain English and straightforward for a general professional to understand. Good examples of these are the science, tech, business, and economics articles from The Economist. Is data or analysis not complete? Tell the hypotheses and why these are the hypotheses to look into further. Have a recommendation? Tell the recommendation and also it’s trade-offs and whether a real world experiment will say for sure.

*  Avinash Kaushik, DJ Patil, William Chen, Nate Silver,  Steven Levitt, and more greats.

PS: Check out more articles on building products. I write to pay it forward and to sharpen my thinking.