Analytics: Signal Detection – Part 01

Data analysis often involves creation of pivot tables in an effort to identify drivers of some important observation. A typical scenario is a growth of one metric with time, and the task is to identify possible causes responsible for observed growth. Amongst the simplest of cases is a relation of type (time, category, metric), where for every time measurement, every category, there is only one value of the metric. So a straightforward thing to do is to create what is known amongst analysts as ‘pivot table’ – a built-in tool readily available in Excel, which allows to create cross-tabulations. Thus, from a three column dataset describing the the relation above, you’ll have a large table with columns being the time points and rows the categories, and each cell being the value corresponding to the combination of time and category (like a Cartesian plane.)

It’s tempting to sort the categories in descending order of total value (i.e. the sum across time), look at top ones and call them the drivers of the trend. I’ve seen that happen many times and for a good reason – often, it’s the logical thing to do (sometimes it’s called Pareto analysis). If your dataset represents growth in clientele by geography, and your customer base is not geographically uniform, you would expect to see differences amongst individual geographic units. Some will have larger numbers than others, and the ones on the top are drivers of the trend. The next step in a typical analysis is to invoke additional dimensions, situational knowledge and explain why it makes sense to see these specific categories as the top ones. The story often ends there…

But what if it’s true there is more to the data and, in order to figure out the pattern, one doesn’t need to have any knowledge of situational specifics? No additional data is necessary, the pattern is there, but a simple look at top categories in the pivot table would lead to erroneous conclusion about root causes of growth? Is it even possible in real life?

Well, let’s suppose the root cause of growth is complex and manifests itself in a multitude of ways. Following the example above, your clientele is growing because a certain controlling agency assigns customers to you and customers belong to their own categories, which do not entirely correspond to geography. While there is a growth across all customer types, one type is actually the top reason, i.e. the driver. As a data analyst, you’re not aware of any of it, the dataset you’re given breaks down customer counts by geography and time point, not by customer type. Also, for reasons of their own, the agency’s policy dictates that customers of certain types may be assigned to only to subsets of geography units. This roughly translates into saying that customer of a given type may reside only in certain geographic units and not in others. But it doesn’t assign a singular geographic unit for any given customer type, so there is no functional dependency. Critically, the assignment of customers of the same type between different geographic units has a pattern which appears to you as random. Here is an example: 10 customers of type A have been assigned to you by the agency in this month, 5 of them are in White Pine, 3 in Lincoln and 2 in Clark counties. In the next month, you’re assigned 20 customers, 7 of them in White Pine, 8 in Lincoln and 5 in Clark counties. The prior month assignments have no bearing on the situation in the current month.

Now, you probably see what I’m leading up to. The dataset by time point, geography and customer count can be used to figure out the customer type that is driving the growth. If you’re given this much information already and you’re an analyst operating in a real business setting, you’d probably look at the different geographies certain customer types are known to reside in (from historical data) and will wind up knowing which combinations of categories (geography units) to look for in your source data.

But, what if you’re not given any of this information and are just given the dataset? Is there a way to figure out the actual grouping of categories responsible for growth? For one, the important thing is to have a suspicion that such a grouping could be responsible for the observed trend.

This is what I loosely refer to as signal detection (not to be confused with the traditional meaning of the term as used in physics.) The data you’re given may not contain dimension which is actually responsible for the growth pattern. As if the data points are splinters of the larger whole. The signal (i.e. growth) can be very small in some seemingly unimportant categories, but if you combine certain small categories together, they become one large combination showing far stronger, indubitable pattern (for example exponential). Of course your task then is to explain why such grouping makes sense, but that is specific to the business you’re in.

Attached is the dataset I have in mind. Given all the information above, you’re already >= 50% done. Next post will contain solution to the problem using standard and not-so-standard statistical tools.

Dataset

Advertisements

~ by Monsi.Terdex on January 18, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: