Analytics: Signal Detection – Part 07 – Data Wrinkle

One of the earliest posts in this blog had to deal with signal detection when the actual signal was concealed and had to be amplified. More specifically, suppose we have a dataset with three columns: signal name, time and signal value. We then create  a crosstabular display, looking at how each signal behaves with respect to time. The question is: which signal is the driver of the trend, and the typical analyst solution is to sort signal name column in descending order of overall magnitude to determine the top signal – that is declared the driver of the trend. I challenge such approach by noting that it would be worth to create a correlation triange (correlogram), which would show Pearson coefficient for every combination of signals. We may then identify strongly correlated signals which appear on different levels of the pareto and, when combined together, may form an even stronger candidate for the driver of the trend. This idea can be easily implemented provided that time series for every pair if signals are comparable. Unfortanetely in the real world things are not always that smooth: you may have a pair of signals which have values for the earlier timepoints but not for the later, yet the simplistic approach is to compute Pearson using every data point. Thus one may need an algorithmic twist to select only that part of the time series which can be compared against the other signal.

Here is a simple solution: below is a C++ code which loads several timeseries from flat file (one timeseries per row, name is in left-most column). The timeseries tableau is tattered, i.e. every signal’s timeseries contains intervals which are abnormally low in signal value and cannot be used as valid data points in deriving Pearson coefficient. The strategy is to zero-in on that interval of timeseries where both signals contain values that are larger than a certain threshold (in the code below it’s 5). The ‘out’ file contains Pearson for every pair of timeseries calculated for those longest extracted intervals.

/*
========================================================================
Author: Monsi Terdex
Date: 07/27/2013

Description:

– Pearson coefficient calculator
– loads time series from flat ASCII file
– for every pair of time series, extracts longest interval which
contain every data point higher than threshold value
– calculates Pearson on the pair of these extracted intervals
========================================================================
*/

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <fstream>
#include <sstream>
#include <math.h>
#include <vector>

using namespace std;

double magnitudeThreshold = 5.0;
int minimalLength = 5;

typedef struct {
    double * a;
    double * b;
    int len;
} ArrayRecord;

/*
=======================
Maps first and second timeseries to new versions
which which both contain data poitns
above threshold level

Parameters:
n – count of data points in both series
a – first timeseries
b – second time series

Returns: record with pointers to modified series
containing only the longest interval where
both time series have acceptable threshold values
=======================
*/

ArrayRecord compress (int n, double * a, double * b){
    int min = 0;
    int max = 0;
    int f = 1;
    int maxLen = 0;
    
    int minFinal = 0;
    int maxFinal = 0;
    
    for (int i = 0; i < n; i++){
        if (a[i] > magnitudeThreshold && b[i] > magnitudeThreshold){
            if (f) {
                min=i;
                max=i;
                f=0;
            } else {
                max = i;
            }
        } else {
            if (f) {
                
            } else {
                max = i-1;
                
                if (max – min + 1 > maxLen){
                    minFinal = min;
                    maxFinal = i – 1;
                    maxLen = max – min + 1;
                }
                
                f= 1;
            }        
        }
    }
    
    if (!f){
        max = n-1;
        if (n-min > maxLen){
            minFinal = min;
            maxFinal = n-1;
            maxLen = n – min;
        }    
    }
    
    
    ArrayRecord ar;
    if (maxLen == 0) {
        ar.a = NULL;
        ar.len = 0;
    } else {
        ar.a = new double [maxLen];
        ar.b = new double [maxLen];
        
        int k = 0;
        for (int i = minFinal; i <= maxFinal; i++){
            ar.a[k] = a[i];
            ar.b[k++] = b[i];
        }
        
        ar.len=maxLen;
    }
    return (ar);
}

/*
=======================
Extraction function

Parameters:
x – first series
y – second series
n – number of data points in both series (assumed the same for both series)

Returns: Pearson coefficient between two series
=======================
*/

double pearson(double * x, double * y, int n ) {    
    ArrayRecord ar;
    ar = compress (n, x, y);
    
    int i;    
    double r = 0.0, xbar = 0.0, ybar = 0.0, sx = 0.0, sy = 0.0;
    
    
    double * xx = ar.a;
    double * yy = ar.b;
    n= ar.len;
    
    if (n < minimalLength) return 0.0;
    
    
    for(i = 0; i < n; ++i) {
      xbar += xx[i];
    }
    xbar /= n;

    for(i = 0; i < n; ++i) {
      ybar += yy[i];
    }
    ybar /= n;

    for(i = 0; i < n; ++i) {
      sx += (xx[i] – xbar) * (xx[i] – xbar);
    }
    sx = sqrt((sx / n));

    for(i = 0; i < n; ++i) {
        sy += (yy[i] – ybar) * (yy[i] – ybar);
    }
    
    sy = sqrt(sy / n);

    for( i = 0; i < n; ++i ) {
      r += (((xx[i] – xbar) / sx) * ((yy[i] – ybar)/sy));
    }
    r /= (n);

    return r;
}

/*
=======================
TimeSeries holds timeseries with associated name,
length and other parameters
=======================
*/

class TimeSeries{
    public:    
        string name;
        double * values;        
        int n ;        
        TimeSeries(int n){
            
            this->n  = n;
            values = new double[n];
            
        }
        
};

/*
=======================
Program entry point

=======================
*/

int main(int argc, char ** argv) {
    srand(time(NULL));
        
    int n = 36;
    int i = 0;
    int j = 0;
    double threshold = 0.95;

    ifstream inFile (“input.txt”);
    vector<TimeSeries *> timeSeriesCollection;
    
    while (!inFile.eof()){
        TimeSeries * t = new TimeSeries(n);
        
        string s;
        getline(inFile,s);
        stringstream ss(s.c_str());
        
        ss >> t->name;
        i =0 ;
        while (!ss.eof()){
            ss >> t->values[i++];
        }        
        
        
        timeSeriesCollection.push_back(t);
        
    }
    
    ofstream outFile(“out.txt”);
    outFile << “————————————–” << endl << endl << endl;
    
    for (i = 0; i <  timeSeriesCollection.size(); i++){
        for (j = i + 1; j <  timeSeriesCollection.size(); j++){
            double p = pearson(timeSeriesCollection[i]->values, timeSeriesCollection[j]->values, n);             
                    
            if (p > threshold) {
                outFile << timeSeriesCollection[i]->name
                        << ‘\t’
                        << timeSeriesCollection[j]->name
                        << ‘\t’
                        << p                        
                        << endl;
            }
        }
        
    }    
    
    outFile << “————————————–” << endl << endl << endl;
        
    
    return(0);
}

signal_data.txt

pearson_interval_selector.cpp

Advertisements

~ by Monsi.Terdex on July 27, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: