Analytics: Signal Detection – Part 03

In the previous post on signal detection (part 02) in data mining, I showed how a simple correlogram could be used to figure out composition of signal which is recorded across multiple categories. Again, the idea is that for each time point, there are signals of multiple types and the overall trend in signal value is an increase with time. The analyst’s challenge is to figure out root cause of growth. This is typically accomplished by a Pareto-like approach: sort the signal types in descending order of magnitude and call top ones the drivers of the trend (this can be easily explained to the audience). Such approach is potentially erroneous, since the driver of the signal growth can be recorded across multiple signal types, and the composition of types which record signal can even change with time points. I guess the main challenge is to keep utmost vigilance and to be paranoid about cryptic features in the data, as opposed to traditional business analyst gut reaction: trying to invoke other dimensions, business knowledge and situational specifics to stick with the simpler, albeit more defensible explanation. This, in my opinion, is the key difference between data scientist/miner and business analyst.

Back to signal types, I can be given three tuples corresponding to three successive time points:

(16, 35, 5, 18, 30)
(2, 27, 3, 16, 42)
(17, 33, 17, 16, 18)

In the first tuple, the signal is recorded in 2nd and 3rd component, so signal value is 40. The second tuple contains signal recorded in 3rd and 5th components, it’s value is 45. The third tuple has signal contained in 1st and 2nd components, it’s value is 50. So, in every tuple the value of signal exceeds that of any other category and signal varies linearly with time.

In a more advanced example, the number of such categories can be very large and the trend, while linear, could be obfuscated by small random error. While traditional signal detection concerns itself chiefly with time-series analysis without differentiation of signal across multiple categories (i.e. single dimension), the scenario outlined above is closer to real-world data-mining challenges, when the number of dimensions is really high.

So the previous post contained a dataset with signal broken out by categories ‘f’, ‘h’, ‘i’, ‘j’ where the values in each are actually randomly distributed, so correlogram view doesn’t show negative Pearsons between ‘f’ and the other three categories as was the case with the initial dataset.

Another way to identify ‘f’, ‘h’, ‘i’ and ‘j’ is through pair-wise one-way ANOVA. The signals have to be normalized with respect to maximum magnitude in each category. Excel has a data analysis tool which includes an option for one-way ANOVA, but you’d have to do quite-a-bit of manual work getting pair-wise analysis for each combination of categories. So instead, the following R script calculates p-value for every combination of categories. For this script to work, the data loaded in R must be of two-column form (Value_ and Category_) as opposed to cross-tabulated form found in the spreadsheet.

d = read.clipboard(header=TRUE);
d = as.data.frame(d);
for (i in 1:10){
     for (j in (i+1):10){
          test=aov(Value_ ~ Category_, data = d[d$Category_ == i | 
          d$Category_ == j, ])
          print(c(i,j,summary(test)[[1]][["Pr(>F)"]]))
     }
}

The resultant data can be easily transformed to a form compatible with Excel spreadsheet (i.e. tab-separated) and the result is below.

Pair-wise ANOVA

So, ANOVA is supposed to show us if the mean between two categories is significantly different, based on variance. The above table shows p-value of one-way ANOVA test for every combination of signal categories in the data. If we take 0.05 to be the cutoff value at which null-hypothesis is rejected, the above results suggest significant difference between the normalized means of category ‘a’ and ‘f’, ‘g’, ‘h’, ‘i’ and ‘j’. Effectively, the categories appear to be split in two parts. While, of course, ‘g’ was not a part of signal, you can see that technique above effectively narrowed the signal categories to those that we need. This speaks to power of ANOVA.

Next post in the rubric of signal detection will expore pattern which is concealed in the dataset below.

Dataset_ 4

Advertisements

~ by Monsi.Terdex on February 15, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: