Analytics: Signal Detection – Part 02

So in the last week’s post I introduced the problem of signal detection – again the title is not to be confused with the traditional interpretation of the term as used in physics and higher mathematics. I proposed a dataset where growth in the variable with time was broken down by categories. In the professional analytical setting, the exercise of identifying root cause of growth would often consist of creating a pivot table and sorting the categories in descending order of total value across all time points (Pareto principle), the top several categories (possibly with cumulative percent cutoff) are thought of as the leading causes. This approach is appropriate since in many situations, you wouldn’t have a reason to suspect anything cryptic with the data. Searching for hidden patterns must be warranted, since it can consume a lot of analyst’s time.

The answer to the problem in the last post was actually rather simple, but what is even better is the simple way it can be found. If you look into correlogram showing Pearson coefficient between different categories, you’ll see (perhaps with the use of highlighter) that category ‘f’ correlates negatively with categories ‘h’, ‘i’ and ‘j’. Not only that, but what stands out is that in the entire correlogram, these are the only combinations of categories which have negative correlations, so it should immediately attract attention. As my analytics experience shows, analysis is most often ‘simply’ about maniacal attention to detail and much less about knowing sophisticated techniques.

Analytically, the result makes sense if the following assumption is made: when the signal is broken down by different categories, each category receives some part of larger value and some other category, which doesn’t change with time, receives the rest. For example, suppose that for time point 30, the total value of the signal is 3,500. The value is broken down between four categories: ‘f’, ‘h’, ‘i’, ‘j’ and the following tuple describes the relation (57, 932, 2,240, 271). It is possible that signal was broken down in the following manner: first category ‘h’ was assigned a random value between 1 and 3,500 (932). That value was subtracted from 3,500 forming a new maximum (in this case 2,568). Next, category i was assigned a random value between 1 and 2,568: in this case 2,240 forming a new maximum of 328. Next, category j was given 271. The remainder, 57, was assigned to the ‘left-over’ category f. Now the negative correlations observed in the correlogram below make sense: rise in ‘f’ would likely correspond with decline in ‘h’, ‘i’, ‘j’ and vice-versa.

It should be clear that the breakdown of signal value between ‘f’, ‘h’, ‘i’, ‘j’ described above is not truly random: if you were to remodel it using a genuinely random process, you’d not have seen the negative Pearsons between ‘f’, ‘i’, ‘j’, ‘h’.

So, at the end, why it was called a signal? Well, if you add ‘f’, ‘h’, ‘i’, and ‘j’ categories you’ll get a clear linear pattern in the data, and the combined category will bubble up to the top in the Pareto chart, making it a much stronger (pretty much indubitable) candidate for driver of the trend, as opposed to category ‘a’.

What if the breakdown of signal between categories ‘f’, ‘h’, ‘i’ and ‘j’ would have been genuinely random? Then we would not have observed the negative Pearsons in the correlogram and wouldn’t have been able to figure out the categories which make up the signal. Let’s see what other appoaches can be used to attack this problem. Below is the link to revised dataset, very similar to the one in the previous post where value of signal for every time point is actually randomly spread between four categories as above: ‘f’, ‘h’, ‘i’ and ‘j’.

correlogram

Signal_chart

And finally, here is the C++ code which I used to generate the resultset.

#include <iostream>
#include <fstream>
#include <string.h> 
#include <cstdlib>
#include <vector>

using namespace std; 

vector<int> split(int s, int n){
    vector<int> v;
    int s1 = 0;
    for (int i = 0; i < n - 1; i++){
        s1 = rand() % s;    
        v.push_back(s1);
        s -= s1; 
    }
    v.push_back(s);
    
    return v;
}

int main(int argc, char ** argv){
    int s = 500;
    ofstream outFile("out.txt"); 
    vector<string> names;
    names.push_back("a"); 
    names.push_back("d"); 
    names.push_back("z"); 
    names.push_back("q"); 
    
    names.push_back("b"); 
    names.push_back("e"); 
    names.push_back("f"); 
    names.push_back("g"); 
    names.push_back("h"); 
    names.push_back("i"); 
    
    
    for (int i = 0; i < 100; i++){
        s = 500 + i * 100; 
        vector<int> v = split(s, 4); 
        for (int j = 0; j < 4; j++){
            outFile << i << '\t' << names[j] << '\t' << v[j] << endl;
        }
        
        for (int j = 4; j < 10; j++){
            outFile << i << '\t' << names[j] << '\t' << rand() % s << endl;
        }
        
    }
    return(0);
}

Signal Detection_ Dataset_ Part 02
Advertisements

~ by Monsi.Terdex on January 25, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: