## Analytics: Signal Detection – Part 04

So last post in this rubric referenced a new dataset (Dataset_ 4) and I thought it was, in some sense, radically different from the ones I proposed previously. This time, the idea is ‘simply’ to figure out the structure of it. We have five attribute columns and they all seem to increase with time. A natural reaction is to look at the behavior of the combined signal – simply add all the five categories and see how they vary with time (Figure 1). An astute observer would already notice a mountainous silhouette which seems to be ‘hidden’ in the overcast sky enveloping the horizon. If one was to approximate an outline of mountain range by some kind of mathematical pattern, there would be cyclicality somewhere, right? Hmmm, first suspicions, ok, moving forward…

We already have correlogram as a first line of attack on such datasets with multiple signal categories (that follows from the use of it in previous posts in this rubric.) So correlogram (Figure 2) on these six columns (sixth being the total) shows that, apparently, signal ‘A’ correlates very strongly with total, so we shall look further into it (deductive approach). As expected (see figure 3 below), ‘A’ also shows a ‘mountainous’ pattern when plotted vs time – perhaps even stronger one than the total. Einstein said: imagination is more important than knowledge. The leap of imagination here, is to look at the cumulative value of signal with time (Figure 4). Actually, this is one of the usual things to do when looking at the time series (other ‘standard’ tools include moving averages, variance, autocorrelation etc.)

This is essentially the ‘aha’ moment in this challenge. An experienced scientist would be immediately struck by apparent near-perfect polynomial growth of cumulative signal with time. In Excel, there is a way to quickly fit an order-2 polynomial curve, get the R^2 value and the equation. Needless to say, the R^2 is almost one in this case, confirming the visual guess. Such strong correlation cannot be dismissed as noise – something is going on. Typically one would try to use time series to build a forecasting model. In that case, analysis of residuals is one of the critical steps to ensure model assumptions were correct. The differences between results of the model and the actual observed data are always of interest. You, dear reader of this, probably wonder why there is even a mention of forecasting techniques – after all, the grand challenge is to figure out some kind of robust structural feature in the data, not to build a forecasting model based on it (?). But techniques employed in construction of predictive model may reveal hidden patterns in the data, which is basically the rationale for using them here.

So, if we were to go on with forecasting model, we would take a closer look at residuals. Remember the ‘mountainous’ silhouette and the periodicity? Well, when you look at the plot of residuals against time, the periodicity is more than apparent – it’s literally staring at you. Now, from the first few data points, looks like period of 10 has some importance right? Looks like the values are somewhat clustered in every 10 ticks or so? Well, why don’t we try to see what is happening with signal ‘A’ when we look at values in groups of 10? First thing – sum it up?

Bingo!

The signal in category ‘A’ turns out to be linearly increasing with time when you add values of each consecutive 10 time points. So, if I add points 1 to 10 I get 500. If I then add 11 to 20, I get 1000 and so on. Perfect linear growth. Next thing (induction) is to verify the same pattern for other signal attributes. Needless to say, similar pattern is found by simple trial and error (brute force approach). At the end, column A has interval of 10, B of 17, C of 24, D of 31 and E of 38.

Let me know if you find any other structurally robost features in this dataset – those were the ones I intended to include, but would love to know if there is anything else to it.