Analytics: Signal Detection – Part 05 – Solution (Benford’s Law)

There is a phenomenon called Benford’s law – distribution of digits in accounting figures and many other measurements from real-world datasets is not entirely uniform (seems like number 1 is much more common than one would think.) So I thought to model a dataset which would use synthetic distribution of digits. You would think that every digit has an equal chance of being present. But if you combine all values for any particular signal into one string and look at the digit distrubution within that string, you’ll notice that there is one (and only one digit) that stands out from the rest. The tableau below shows the analysis on previously posted dataset. This should immediately yield correct answer: signal T is associated with Q, R with K, etc.

http://en.wikipedia.org/wiki/Benford%27s_law

Digit_frequency_tableau

Digit frequency by signal type. Columns correspond to signals, rows to particular digits.

Now, the question is, how are you supposed to figure out this hidden feature? Looking at the raw numbers, I certainly can’t spot anomalous distribution of digits with a naked eye. What makes it so hard to see is the fact that we rarely think of digit distribution (that’s why mysterious overtone to the previous post). Your boss sends you a dataset and asks to figure out what is wrong with it, and your first instinct (as business analyst) is to look at digit distribution? Ha ha, unlikely, unless you’re reading books like ‘Alien IQ Test’ by Clifford Pickover.

So, is it possible to arrive at the solution through one of the standard analysis routes? I don’t know, you tell me…

/*
************************************************************
Author:		Monsi Terdex;
Date:		05/18/2013

Description:	
	- Benford's law puzzle
************************************************************
*/
#include 
#include 
#include 
#include 
#include 
#include
 

using namespace std; 

map<int, int> frequencyTableau; 

/*
========================================
Returns a floating-point number with abnormally
high probability for the specified digit
The number will have one leading digit followed by 3 decimals

Parameters:
j      - the digit to have abnormally high probability
========================================
*/

double generateNumber(int j ){
	double r; 
	stringstream ss;

	ss << rand() % 10; 
	ss << '.'; 

	for (int i = 0; i < 3; i++){ 		
             int m = rand() % 10;  		
             if (m >= 0 && m < 2){
			frequencyTableau[j]++;
			ss << j;
		} else {
			int k = rand() % 10;
			frequencyTableau[k]++;
			ss << k; 		
                } 	 	 	
        } 	
    ss >> r;

    return (r);
}

/*
============================
Entry point
============================
*/

int main (int argc, char ** argv){
	srand(time(NULL));
	ofstream outFile("out.txt"); 
	ofstream outFileFrequency("out_frequency.txt"); 

	for (int j = 0; j < 20; j++) {
		for (int i = 0; i < 1000; i++) 
			outFile << j << '\t' << i << '\t' << generateNumber(j % 10) << endl;	

		for (int i = 0; i < 10; i++){
			outFileFrequency << j << '\t' << i << '\t' << frequencyTableau[i] << endl;
			frequencyTableau[i]  = 0;
		}

	}

	return (0);
} 
Alien IQ Test by Cliff Pickover

Alien IQ Test by Cliff Pickover

Advertisements

~ by Monsi.Terdex on May 31, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: