Decimal Expansion of Pi: Interesting Observations, Part I

•January 1, 2014 • Leave a Comment

Let’s try to read through decimal expansion of pi until every digit from 0 to 9 is found at least once, and note the count of digits we’ve read up to that point (call that count completion run). The first such position for decimal expansion of pi (actually including the leading 3) is 33. That is, if I start reading the digits 3, 1, 4, 1 … it will take me 33 digits before all digits between 0 and 9 are read at least once.  In parallel, let’s do the same for completely random sequence of numbers. Once we reach the point when all the digits have been encountered, we reset the length. Thus between any two sequences of digits, each digit between 0 to 9, there is a sequence of completion runs. Again, if we read pi from 3, 1, 4, 1, 5 … (until 1,000,000 have been read) we get a sequence of completion runs which have following lengths: 33, 18, 19, 27, 15 etc. If we have first 1,000,000 digits of pi written out to pi.txt, the following C++ code would accomplish the task.

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <ctime>
using namespace std;

int main(){
    ifstream inFile ("pi.txt"); 
    ofstream outFile ("out.txt"); 
    char c;
    int i =0;
    int a[10];
    int g = 0; 
    srand(time(NULL));
    for (i = 0; i < 10; i++) a[i] = 0; 

    while(!inFile.eof() ){
        inFile.read( &c, 1 ); 
        int j = c - 48;
        // j = rand() %10; // comment/uncomment to simulate truly random sequence
        a[j]++;
        int k = 0;
        for (i = 0; i < 10; i++) if (a[i] == 0) k = 1;

        if (k) {
            g++;
        } else {
            //cout << g + 1<< ' ';
            outFile << g + 1 << endl;
            g = 0;
            for (i = 0; i < 10; i++) a[i]=0;
        }

    }

    return(0);
}

What does the data show? Well, it’s interesting. Assuming random number generator employed in the code above produces true random sequence, it seems that maximal length of completion runs in the first million digits of pi is larger than that in the control random sequence. The table below shows counts of completion runs for pi and control random sequence. Every completion run length from pi has corresponding value for the control. For example, in my data, 33 (the length of first completion run on pi) is coupled with 15 (the length of first completion run from control sequence). Reading the table below, it follows that first million digits of pi contains a completion run with over 130 digits (it’s a 157 to be exact) and there is no such completion run observed for the control sequence. In fact, there is also a completion run of length 126 (between 120 and 130) and there is no such run observed in the first million ‘truly’ random digits.

Do you feel thrilled? Do we see a glimpse of ’empirical’ evidence that decimal expansion of pi is not truly random? Or do we see manifestation of imperfections in random number generators?

Completion run comparison_ pi vs true random sequence

Advertisements

Recursive Partitioning in R: Example of Customer Loyalty Data Mining

•December 1, 2013 • Leave a Comment

Suppose you have data for customers, the items they shop for and customer loyalty index for every customer. Your task is to make sense of the data, i.e. is there a way to conclude anything meaningful about customer loyalty from the items customers are shopping for?

There could be several ways. In the end, the findings of your research will have to be accessible to non-technical, decision-oriented audience. It matters because many data-mining techniques involve esoteric methodology requiring considerable exposure to mathematics, statistics and computer science. Decision trees come to rescue – they are easy to understand and there are freely available implementations. I’m going to cover decision tree construction in R using rpart package together with some primitive tree plotting.

The data comes from a three column dataset – column C is customer id, column I is name of an item customer shopped for, column L is loyalty index (one value for every customer (although it repeats itself). The task is to use rpart package to create a regression tree predicting loyalty index depending on the items customer shopped for. First, we need to transform the dataset into usable form, i.e. crosstabulate with respect to items, so that every column corresponds to an item and the values are either 0 or 1 indicating if customer shopped for that item. The last column is loyalty index.

require(rpart); # package which will perform regression tree analysis

require(rpart.plot); # fancy plotting for trees

d = read.table(‘clipboard’, header=T, colClasses=”character”);

names(d) = (‘C’, ‘I’, ‘L’);

d$L = as.numeric(d$L);

t = table(d$C, d$L);

colnames(t) [1] = ‘C’; # t is result of crosstabulation, lacks necessary column name

rr=aggregate(d$L, by=list(d$C), max); #making a separate dataset with one line per customer

colnames(rr)= c(‘C’, ‘L’);

tm = merge(t, rr, by=c(‘C’)); #inner join

p = paste0(paste0(“tm$L ~ tm$'”), paste0(colnames(tm)[2:(length(colnames(tm))-2], collapse=”‘ + tm$'”)), “‘”); # most convoluted piece of code; creating formula for regression tree – necessary because the count of items could be very large

tr=rpart(as.formula(p),method=”anova”,data=tm); # since loyalty index is floating point, have to use “anova” method

prp (tr); #fancy tree plotting

Market Basket Analysis: Apriori Algorithm in R, Part 01

•November 1, 2013 • 3 Comments

At work, I like an idea of unconventional data explorations. Market basket analysis comes to mind – perhaps because it’s simple to explain to your sponsors/superiors, and also due to relative technical ease of implementation. The steps outlined below are for R users and will get you from Excel-like data (i.e. text input copied to clipboard) to transactions and rules.

Arules package in R has apriori algorithm which allows to construct implication rules for itemsets. Suppose you have some kind of itemset and items can show up in various groupings. You’d like to know if there is a any co-occurrence between items or sets of items. For example, if many groupings that contain item A also contain item B, we can take that as an implication rule (association rule). So, suppose your data consists of two columns in Excel: one column contains groupings and the other contains items observed in those groupings. Here is an example:

Groupings Items

A A
A E
A H
A I
A J
A K
A M
A Q
A U
A V
A X
B B
B E
B F
B I
B K
B L
B N
B O
B Q
B R
B V
B Y

So we have two groupings A and B and a number of items. For R’s apriori algorithm to work, it must see the data as a set of “transactions”: each transaction corresponds to a grouping above, each column corresponds to an item from the dataset. Whenever an item is present within a grouping / transaction, we need to specify 1 else 0 (so called binary matrix). In our example, we have:

Grouping A B E F H I J K L M N O Q R U V X Y
A 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 1 1 0
B 0 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1

Again, columns are items (that’s why there is so many of them in this example), rows are groupings (only two, just as in the data above) and values are counts or binary indicators, e.g. how many of item Y are in grouping B?. Now, the question is, given initial columnar data format, how do we get it to look like the one above? Copy the data in original form to clipboard. On R console type:

d  = read.table (‘clipboard’, header=T); # [1]

t = table (d$Groupings, d$Items);  #  [2]

d = as.data.frame.matrix(t); # [3]

tr = as(as.matrix(d), “transactions”); # [4]

rules = apriori(tr, parameter=list(supp=0.95, conf=0.95, target=”rules”)); # [5]

inspect(rules); # [6] example output: {I, K, V} => {Q}

The above instructions will cross-tabulate the initial input (line [2]), construct the matrix (line [3]), convert it to transactions (an input which apriori package will understand, line [4]) and generate the rules. As you see, default parameters are support (i.e. frequency within input dataset) of 0.95 or higher, confidence of 0.95 or higher and I am not specifying lift parameter. Next time will discuss conceptual background behind association rule mining, in the context or R.

Maximal Surpasser Count Problem (Binary Search Tree Solution)

•September 23, 2013 • Leave a Comment

I really like Pearls of Functional Algorithm Design by Richard Bird. The book just has that unique quality to it – it makes you want to think. One problem got my attention – maximal surpasser count. You have an array of integers and for every element of the array you count the number of elements to the right that are strictly greater than that element. For example, the array may be:

82    74    17    93    96    20    25    55    15    24    25    56

and the resultant surpasser count array is:

2    2    8    1    0    5    2    1    3    2    1    0

and the maximal surpasser count is 8. Note, the problem asks to find 8, not to construct the array in full but we’ll do that anyway.

So the challenge is to figure out a O(n log n) time algorithm to solve this problem. The space complexity is not specified. The book outlines convoluted (and hence interesting) solution in Haskell, which is absolutely a must-learn language for any computer-science-minded individual. But let’s try to solve it using binary search tree only. If we succeed, we can code the problem in an imperative language using familiar algorithms and also in O(n log n) time.

The idea is simple. Every node in the tree is to have three fields – the key, i.e. number in the array that the node corresponds to, number of nodes inserted to the right (A) and the maximal surpasser count so far (B). There is one node for every distinct element of the array. We start reading the array from the right to the left. The first node gets a key of 56 and both fields are set to 0. Next number is 25, which gets inserted as a left child of 56. Because we made a move to the left, the new node will have to add 1 to its count of surpassers. Next one is 24, and, since it requires two moves to the left, it gets to have 2 as its count of surpassers Skipping 15, when we get to 55, we make one move to the left but then to the right. So node 55 will have surpasser count as 1 (correct) but what about node 25? Remember, 55 gets to be inserted to the right of 25. Node 25 will have its B incremented, and next node inserted after 25 will have to pickup the B.

Let’s take a look at what the solution looks like in C++:

#include <iostream>
#include <vector>
#include <algorithm>

using namespace std; 

class Node;
vector vc;

class Node {
  public:
    Node * left;
    Node * right; 
    int k;     // key
    int c;    // count of duplicate nodes
    int surpasserCount, largerThanCount; 
    Node(){    
      right = left = NULL;
      surpasserCount = largerThanCount = k = c= 0 ;
    }
};

void insert(const int k, Node * r, int m){
  if (k == r->k){
    r->surpasserCount = r->largerThanCount + m ;     
    r->c++;
    vc.push_back(r->surpasserCount);

  } else if (k k) {
    if (r->left){
      insert (k, r->left, m+1 + r->largerThanCount+r->c);
    } else {

      r->left = new Node; 
      r->left->k = k;       
      r->left->surpasserCount =  m+1 + r->largerThanCount + r->c;
      vc.push_back(r->left->surpasserCount);
    }

  } else {
    r->largerThanCount ++; 
    if (r->right){

      insert (k , r->right, m); 
    } else {

      r->right = new Node; 
      r->right->k = k;       
      r->right->surpasserCount = m + r->c;
      vc.push_back(r->right->surpasserCount);
    }  
  }
}

int main(int argc, char ** argv) {
  int k[] = {82, 74, 17, 93, 96, 20, 25, 55, 15, 24, 25, 56};
  int l = 12; 

  Node * r = new Node;
  r->k = k[l-1]; 
  vc.push_back(0);

  for (int i = l-2; i >= 0; i--){
    insert (k[i], r, 0);
  } 
  for (int i = vc.size()-1; i>=0;i--){
    cout << vc[i] << " ";
  }
  cout << endl;
  cout << "Maximal surpasser count is: " << * max_element(vc.begin(), vc.end()) << endl;
  return(0); 
}

http://www.amazon.com/Pearls-Functional-Algorithm-Design-Richard/dp/0521513383#reader_0521513383

Pearls of Functional Algorithm Design

In addition, below is a zip file (renamed as doc for upload purposes) which contains a bunch of pds with Small Programming Exercises by Rem (from the Science of Computer Programming journal dating back to 1990s or so). The maximal surpasser count problem is contained in one of these.

Small Programming Exercises.zip

The Truth is Out There: Probability of Upcoming Earthquake in New York City – First Look at Data

•September 8, 2013 • Leave a Comment

Recently, it has been two years since August 23rd, 2011 – when a relatively strong earthquake shook New York City and surrounding boroughs. Shock was strong enough to hit major media outlets. Of course it’s nothing compared to what other regions of the world and US are getting much more often, but fellow New Yorkers were shaken – I remember rushing down the stairs of my house – I was on vacation that week. Many wondered at the time if there were any signs that an earthquake was approaching. In this day and age – data is all the rage, so in case if someone attempts a forecast, it should be based on seismic recordings, many of which are now available on the net.

So, continuing the rubric of real-world data exploration (as opposed to synthetic datasets contrived to hone analytical skills and conduct interviews), this time I’m ‘diving’ into seismic data analysis. It won’t be overly detailed, as I’m not a seismologist, but it should awaken your interest in the topic and, hopefully, motivate to go out there, ‘grab’ the data (sounds easier than it actually is) and mine it. Share with me whatever you find as well.

As for my sources, I used this:

http://www.iris.edu/SeismiQuery/sq-eventsmag.htm

Region 34, code 472 (New York) is narrow enough to return enough data points.

First look? Well, let’ just say August 2011-th earthquake wasn’t the first one as far as magnitude is concerned. It seems we’ve had comparable earthquakes back in April 20th, 2002 and (magnitude 5.3, depth 11 km) and August 4th, 2004 (magnitude 3.8, depth 4 km).

What is more curious is that, when I look at the count of events as registered by the seismic networks monitoring New York, the past 3 years seem to be abnormally low with the respect to the previous five years. Looking at time series plot of magnitude of earthquake registered in New York region, is seems there has been some considerable ‘slowing down’ in terms of activity.

What could that mean? Well, perhaps fellow New Yorkers are due for another earthquake soon… Or, maybe, there is a lag in data collection?

 

 

New York Region_ seismic events by year

 

New York Region_ maximal magnitude of seismic events by date

 

 

 

 

 

 

 

 

Attached is an ASCII tab-delimited format ready for import into Excel as a relational table.

New York Region_ earthquake data_ ASCII.txt

 

Fuzzy String Matching in Excel and Access: Longest Common Subsequence

•August 29, 2013 • 4 Comments

You’ve already seen a post on fuzzy string matching. As is intuitively clear, there is no right or wrong solution for this problem, because there is no universal definition on how to account for string similarity.

Computer scientists often speak of subsequences of a sequence: take a subset of all indices of a sequence, sort them in ascending order and map to corresponding elements of the sequence – you got yourself a subsequence (if you qualify this further – proper subset of indices would mean proper subsequence). How can we use this in name matching, (for example)?

Suppose we have two cells in Excel, one with value ‘John’ and one with ‘Joohan’ (two erroneous insertions). The longest common subsequence is then ‘John’ and has length 4. This length equals to the length of first sequence but not the second, so it’s not a perfect match. One possible way to account for disparity is to divide the length of the common subsequence by the length of the first sequence and add to this ratio the same ratio for the second sequence. So, in case of perfect match, both ratios are 1 and the resultant metric is 2. We need to normalize the metric so that, in case of perfect match, the metric is 1. Straightforward way to normalize is to weigh each component with respect to its total contribution, i.e. total length of both sequences.  The following manipulations show what the algorithm reduces to:

lgs / len1 * len1 / (len1 + len2) + lgs / len2 * len2 / (len1 + len2) = 2 * lgs / (len1 + len2)

The algorithm below is directly from Excel’s VBA editor but could be used in Access (as usual):

' ======================
' Provides length of longest common subsequence
'
' Parameters:
' s1, s2 - sequences to be compared
'
' ======================
Public Function getLongestCommonSubsequenceLength(ByVal s1 As String, ByVal s2 As String) As Integer
	Dim m As Integer
	Dim n As Integer
	m = Len(s1)
	n = Len(s2)
	Dim b(0 To 100, 0 To 100) As Integer
	Dim c(0 To 100, 0 To 100) As Integer
	Dim i As Integer
	Dim j As Integer
	For i = 1 To m
		c(i, 0) = 0
	Next i
	For j = 1 To n
		c(0, j) = 0
	Next j
	For i = 1 To m
		For j = 1 To n
		  If Mid(s1, i, 1) = Mid(s2, j, 1) Then
			 c(i, j) = c(i - 1, j - 1) + 1
			 b(i, j) = 1

		  ElseIf c(i - 1, j) >= c(i, j - 1) Then
			 c(i, j) = c(i - 1, j)
			 b(i, j) = 2
		  Else
			 c(i, j) = c(i, j - 1)
			 b(i, j) = 3
		  End If
		Next j
	Next i
'
	getLongestCommonSubsequenceLength = c(m, n)
End Function
' ======================
' Provides fuzzy comparison score for string matching
' using longest common subsequence
'
' Parameters:
' s1, s2 - sequences to be compared
' ======================
Public Function compareTwoSequences(ByVal s1 As String, ByVal s2 As String) As Double
	Dim l1 As Integer
	Dim l2 As Integer
	Dim l3 As Integer
	l1 = Len(s1)
	l2 = Len(s2)
'	
	l3 = getLongestCommonSubsequenceLength(s1, s2)
'	
'	
	compareTwoSequences = 2 * l3 / (l1 + l2)
End Function

LongestCommonSubsequence.bas

Andoid Application: Harnessing Text-To-Speech to Create Audio Math Gym

•August 3, 2013 • 1 Comment

You might have noticed that I place great premium on basic arithmetic facilities. I do genuinely believe them to be cornerstone of analyst’s competencies. This summer I am losing weight, and am faced with dozens of minutes away from computer, book or anything otherwise meaningful (so, intellectually, I am in the doldrums). Having exhausted my auditory and mental ability to withstand same musical rhythms over and over again, I finally did it – I wrote a math gym application in Java for my Andoid phone. It plays sequence of 10 random operations (addition or subtraction) on random numbers and keeps track of the sum. At the end of 10th number, it pauses and reads the correct answer for you. The idea is, while it reads the numbers, you compute them mentally and then compare your answer with the correct one. The beauty is, you don’t need to do anything with the device (doing anything with your hands while on a glider is not really comfortable). The program will start reading you next sequence and so on ad infinitum until you terminate it. There are settings to change the speed of voice.

Below is the zip file with the app and source code (it’s named HelloWorld but it’s an actual app). I really hope some dedicated Android developers will pick up on this idea and make a more solid application out of it.

AudioMathLearn.zip

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.