‘A User’s Guide to Principal Component Analysis’ by Edward Jackson comes to rescue. It has detailed exposition on how to perform PCA, along with intermediate stage results so that learner can ensure they’re on the right track. Invaluable work for a student of such an important technique.

Perhaps I should make a brief note on why PCA is important — it’s one of the most popular methods of multivariate data analysis. Suppose you have a dataset which describes closing price of a selection of stocks by day. You can create a correlation matrix of stocks and perform PCA (since different stocks have different average price, it’s a good idea to use correlation matrix vs covariance). The principal components will show most important dimensions — each dimension will have negative and positive contributions from each stock. Create a scree plot to see how much variance is accounted for by each eigenvector. Select a cutoff point and approximate your daily data using just a few eigenvectors. Then see which of your data points (days, in this example) are outliers, i.e. cannot be explained by the selection of eigenvectors.

I would recommend reading chapter 5 — it requires an audiometry dataset, which I include below. In addition, I include my R code which should enable you to produce y scores, x^ (i.e. original data approximated using selection of PCs), residuals and the scatter plot on page 120.

###############################################

# Author: Monsi Terdex

# Date: 07/27/2014

#

# Descr:

#

# This is R implementation of PCA

# using audiometry dataset in Chapter 5 in

# User’s Guide to Principal Component Analysis

# by Edward Jackson (arguably the most thorough

# practical introduction to PCA)

#

# Warning: this code is not optimized,

# and is, in fact, deliberately verbose

# and non-efficient to show how to perform

# these computations step by step in R

#

# The results match those reported by Jackson

###############################################

# dSource is the audiometry data object,

# load it using RData file

d = dSource;

for (i in 1:9) {

d[, i] = as.numeric(d[, i]);

}

# tb0 is the ‘final’ data source used below, i.e. you can

# in theory, substitue your own data source into tb0 and the

# downstream logic should work just fine

tb0 = d[,2:9];

# creating a correlation matrix

# This is due to unequal variance amongst the variables

# otherwise may want to use covariance matrix

# Note that logic below IS specific to correlation

# matrix and will not work for covariance matrix

cr = cor(tb0);

# computing eigen vectors and eigen values of the correlation matrix

evv.cr = eigen(cr)$vectors;

evr.cr = eigen(cr)$values;

# computing covariance matrix just in case + the corresponding eigen

# vectors and values

cv = cov(tb0);

evv.cv = eigen(cv)$vectors;

evr.cv = eigen(cv)$values;

# scree plot matches Figure 5.1 on page 112:

plot (evr.cr,pch=15, cex=1, xlab=’Root number’, ylab = ‘Characteristic root’);

# computing D^-1 matrix as defined in the text

# this is diagonal matrix of standard deviations of the

# original data. For D^-1, the reciprocal of each sd is used

stDev = diag(1.0 / apply(tb0, 2, sd));

# computing D matrix

stDevInv = diag(apply(tb0, 2, sd));

# will be adjusting the original eigen vectors of the correlation

# matrix, so saving them separately.

evv.cr.adj = evv.cr;

# taking count of columns in the original dataset tb0

# this is 8 for the audiometry example

nn = ncol(tb0);

# the count of eigen vectors to be used to approximate the original

# data. This count is 4 for the example in the text

nns = 4;

# dividing each eigen vector of the correlation matrix by

# square root of its eigen value

for (i in 1:nn) {

evv.cr.adj[,i] = evv.cr.adj[, i] / sqrt(evr.cr[i] );

}

# doing the same for the inverse case (this is to compute V* matrix

# in the text

evv.cr.adj.inv = evv.cr;

for (i in 1:nn) {

evv.cr.adj.inv[,i] = evv.cr.adj.inv[, i] * sqrt(evr.cr[i] );

}

# erasing the eigenvectors we do not need

# For audiometry example, only first 4 vectors are needed

# the rest are set to 0

evv.cr.adj[, (nns+1):nn] = 0;

# computing the W and V matrices as per text

wr = stDev %*% evv.cr.adj;

vr = stDevInv %*% evv.cr.adj.inv;

# this is adjustment of the original data for mean values

# there is a more elegant way to do this in R using ‘scale’

# function but just want to demonstrate so that it’s very clear

d.adj = tb0;

# going column by column

for (i in 1:nn) {

d.adj[, i] = d.adj[, i] – mean(d.adj[, i] );

}

# y scores computed using W matrix and match those on page 117 (Table 5.13)

# except sign but the author cautioned not to panic since

# this is a result of algorithm employed

yScores = t(wr) %*% t(d.adj);

# xhat will have the same shape as y scores, so just copying

# them here as placeholder

xhat = yScores;

for (i in 1:nrow(tb0) ){

xhat[, i] = ( vr %*% yScores[, i] ) + as.matrix( apply( tb0, 2, mean) ) ;

}

# transposing the matrix

xhat = t(xhat);

# adjusting for standard deviation: as before, there is

# a more elegant method in R to do this but

# for now, just need to show the mechanics

# just copying original data as place holder

rr1 = tb0;

rr2 = tb0;

# rr1 is the residuals in standard units, i.e. divided by

# standard deviations of original data columns

for (i in 1:nrow(tb0)) {

rr1[i, ] = (tb0[i,] – xhat[i,])/ as.matrix (apply(tb0, 2, sd ) ) ;

}

# predicated (i.e. xhat) values in standard units

for (i in 1:nrow(tb0)) {

rr2[i, ] = xhat[i, ] / as.matrix (apply(tb0, 2, sd ) ) ;

}

# scatter plot below matches one on page 120 of the text

# Figure 5.2

plot(rr1[, 8] ~ rr2[, 8], cex=1.0, pch=19, xlab= ‘x8^ / sd(x8)’, ylab= ‘(x8 – x8^) /sd(x8)’);

# Q values computation

# residuals are in rr1

q = numeric(nrow(tb0));

for (i in 1:nrow(tb0) ) {

rr5 = as.matrix(rr1[i, ]);

q[i] = rr5 %*% t(rr5);

}

# now computing critical Q, page 36

theta1 = sum(evr.cr[ (nns+1):nn]);

theta2 = sum(evr.cr[ (nns+1):nn]^2);

theta3 = sum(evr.cr[ (nns+1):nn]^3);

h0 = 1 – 2 * theta1 * theta3 / (3 * theta2 * theta2);

cAlpha = qnorm(0.95, mean=0, sd=1);

q.critical = (cAlpha * sqrt(2.0 * theta2 * h0 * h0) / theta1 + theta2 * h0 * (h0 -1 ) / (theta1^2) + 1 )

q.critical = q.critical ^ (1.0 / h0);

q.critical = q.critical * theta1;

# q.critical is 2.47 matches that on page 116 for the audiometry example

# computing the Hotelling T^2 values using the

# number a predetermined number of eigenvectors to apprximate

# the data (i.e. nns) which is 4 for the audiometry example

tsq = diag( t(yScores[1:nns, ]) %*% yScores[1:nns, ] )

# computing critical T^2 value — if an observation exceeds

# this critical T^2 value, it’s an outlier

# page 116 of the text indicates the value should be

# 10.23 but the calculation below gives 10.17 — this could be due

# to a data entry error on my part or perhaps the author used data with higher

# precision than what is presented in the text (he seems to indicate that)

# on page 105 (‘for simplicity in display, fewer digits will be presented than

# would normally be required’)

df1 = nns;

df2 = nrow(tb0);

tsq.critical = qf(0.95, df1, df2 – df1) * df1 * (df2 – 1) / ( df2 – df1);

# creating a binary vector of observations which exceed critical T^2 value

tsq.compare = ifelse(tsq > tsq.critical, 1, 0);

# creating final dataset comparable to that on page 117

final = cbind (t(yScores), tsq, tsq.critical = rep(tsq.critical, df2), tsq.compare, Q = q, Q.critical = q.critical );

# selecting outlier observations based on T^2

f = final[ final[, which(colnames(final) == ‘tsq.compare’ )] == 1, ]

# use opinionitis encryptor post for the following

Tobk qk i dgmz imbj qr oox xnaw, ujpphrelz tmlo KL4 cjpphrevd (bma awruxwn yzeb) wlsaesdm ‘Grciiwz’. Gox bddt ek znjnmsz xs ‘llo’ ctwljfqgz sjd dvwdmd qv fox xkds.

Audiometry RData (Chapter 5)

]]>

This latter technique got my attention because output of a neural network is generally scaled to some finite interval (like -1.0 to 1.0) whereas stock price can vary outside of those boundaries and does not have any theoretical limit. One could say that only direction of change is important, so the -1 would mean the price going down and +1 that price goes up, but what if the goal is to predict an actual value (for whatever reason) and not the direction of change?

Genetic algorithms, while providing for arbitrarily scaled output suffer from another drawback :: the length of solution (i.e. count of genes in a chromosome) is fixed. Genetic programming (GP) takes care of that by evolving solutions in the form of syntax trees.

I started experimenting with GP back in June 2010 but quickly dropped the effort because I attempted to create the entire infrastructure (i.e. GP operations such as crossover, mutation, etc) from scratch. Even back then, there were freely available GP libraries. So recently, I decided to resurrect GP exploration efforts, and compiled some sample C++ code using library below. From programmer standpoint, I believe studying, testing and manipulating someone’s code is a much more intellectually demanding activity than designing your own solution (arguably it is also more applicable job-wise as you are much more likely to be given some already existing code to fix as opposed to creating an entirely new solution). But that is water under the bridge.

http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/weinbenner/gp.html (the library seems to have been developed by Thomas Weinbrenner, but I can’t find a recent reference).

Naturally, the specific problem of interest that got my attention was symbolic regression — this is for forex data analysis (or any financial time-series prediction). The idea is, GP will receive pairs of date/time and closing price combinations and will evolve approximator function for the price (which will probably not depend on actual date/time). This, of course, is a simplistic approach — one can devise a sophisticated dataset consisting of ‘features’ (like direction of moving average comparisons) and trend indicator — GP will then construct an approximator for trend indicator as function of those features.

There are many fascinating questions that come up as you go through GP experiments. The one thing you quickly learn is that, they are just that — experiments. Pure analytical approach yields very little progress in GP — changing population size or mutation probability may or may not positively impact the fitness of your final solution. Even the definition of fitness itself is something you will have to create. Some can argue this spells doom for all evolutionary computing since evolved solutions cannot, by themselves, be explained. Others would say experimental approach opens up fertile ground for research efforts. Without getting into these questions, GP has been around for some time and is likely to stay and is worth getting dirty with.

So, what was done? The symbolic regression program (symbreg.cc) was modified to include sin(x) function in its function set and to include special print out function for the solution which allows to translate ADF (automatically defined function) into actual string representation. GP had to approximate the following function: sin(x) ^ 4 + sin(x) ^3 + sin(x) ^ 2 + sin(x).

Why was special print out function introduced? If you download the original library referenced above, compile and run symbreg, it will produce data.dat file which contains best fitness individual by generation. The problem is, however, that best fitness individual would be described like below:

Best of generation 20 (Fitness = 0.00182708, Structural Complexity = 25)

GP: ADF0 (sin(x),x%x)

ADF0: x1*x1*x1*x2*(x1+x2)+x1*x1+x2*x1

Here, we’re seeing that root node of the best individual is ADF of two arguments and, on the second line, we have definition of the ADF. This, however, is not very useful, as I would have to take definition of ADF, replace every instance of x1 with sin(x) and x2 with x % x (which, by the way, stands for protected division and not modulo as many of you would think). This obviously can be done easily for small examples like this, but more complicated examples get too involved. Why do we need such representation? If you want to observe actual solution in action, you would most likely want to put it as a complete formula in R or Excel right next to the actual formula to compare the diffences.

Next steps:

– adding terminals with constant values (yes, symreg from the link above does not include that)

– add lag operators which will read time series a specified number of steps backwards. Currently the system understands only the most simplistic definition of the terminal, i.e. x means literally current value of time series, not the value 1 time point away, etc.

– add ability to read multiple time series

– add boolean functions, like ‘and’ or ‘xor’ etc

– add conditionals

Here is the ‘doc’ file which actually ‘zip’ with the full content of the project:

Symbolic Regression_ Genetic Programming_ CPP.zip

To compile and run:

.\g++ -w -c gpc\*.cc

.\g++ -o symbreg.exe *.o

symbreg.exe

Below is good reference to GP paradigm — besides, of course, the seminal work by Koza.

]]>

So, in a nutshell, I say something like ‘scientists go outside in, while analysts go inside out’. What do I mean by this?

Fundamentally, a scientist doesn’t know how the phenomenon under investigation is supposed to work. If I am looking at the aircraft performance data as its speed approaches sound barrier (1940s example), I do not know what to expect – all my formulas and theoretical understanding are just that – my attempt at modeling the world. I do not have apriori principles of nature and hence I am looking at the data (going from the outside, the observables) to figure out how the phenomenon is structured (inside).

In business, knowledgeable people typically know apriori set of agreed-upon rules, and hence whatever the data we observe, is being interpreted using those rules. There is no ‘theory’ as such – there is a body of knowledge that is undisputed – these rules (inside) are used to figure out if the business is working as expected using collected data (outside), so we’re basically going inside out (from theory, using corollaries, to observations captured in data).

I suppose scientists and analysts are two faces of the same coin. Suppose I am a scientist with a hypothesis – to verify it, I am turning into an analyst (temporarily) because I am interpreting the data for consistency with the corollaries of the hypothesis (deductive step). Similarly, if I am a business analyst, I can temporarily turn into a scientist by going on ‘fishing expeditions’ and approaching the data as if completely unfamiliar with the rules of the business (inductive step). But you get the idea.

Simple? Helpful? I don’t think so. But thanks for reading (and thank g-d, these are rare).

]]>

Why would such clustering be needed? Suppose you’re tasked with identifying all clients living in a particular address. Will take address to be a tuple of (Street, City, State, Zip Code). You’re given a list of client-address pairs. Simplistic approach would be to count the number of clients by address. But same address can be spelled differently — thus the solution is map each address to an id of the cluster it belongs to, and group on that id when doing the count.

Fuzzy matching is done in the following way. Every address is compared with every other address in the file — if we have n addresses in total, will have n * (n-1) / 2 comparisons. Only street address of the record is considered, no city, postal code etc (easily modified). The address is stripped of punctuation marks and white space. Digits are extracted into one separate part, in the order the appear in the actual address, the rest (i.e. letters) are put into their own part. If numerical parts between two addresses differ, we automatically qualify the addresses as distinct. If the numerical parts are identical but address (string) parts are different, we compare using the following algorithm. Take the length of longest common subsequence (LCS) and divide it by the average of l1 and l2 (lengths of the address strings). The score is thus guaranteed to be between 0 and 1 — the higher the score, the more similarity.

LCS algorithm is pretty demanding in terms of computational resources and it will consume much of program’s runtime. The idea is to use multi-threading capabilities offered by C++11 standard (MinGW Posix compiler for Windows). At a very high level, each address pair can be compared by a designated processor, so the program is embarrassingly parallel. No pair of addresses is compared more than once. Expected speedup is linear in the count of processors utilized — you obviously should not exceed the count of threads your processor can actually support. The highest number for desktop systems would be eight (8).

The difficulties are at the clustering step.

Suppose a1 and a2 are two address records which are determined to be similar (so fuzzy matching score > threshold value). If both a1 and a2 do not have clusters assigned to them, will generate new surrogate id for cluster and assign a1 and a2 to this new cluster. If a1 has been assigned to cluster but a2 was not, will assign a2 to a1’s cluster — the reverse is easy.

The hardest part is having both a1 and a2 already belonging to clusters c1 and c2. Then, we have cluster fusion, i.e. will pick either c1 or c2 and substitute instead of the other. If we pick c1, all of c2 will become c1.

The solution is straightforward for a sequential program. But if you’re trying to parallelize it, it is pretty difficult to do it in the right way — will have to cheat. Here is why it’s difficult. Suppose a1 and a2 are being worked on by processor p1 and a3 and a4 by processor p2. Further, suppose these are the cluster assignments (a1, c1); (a2, c2); (a3, c2); (a4, c3). p1 makes decision to make c2 = c1, so every record assigned to c2 must be modified to be assigned to c1. At at the same time, p2 makes decision to make every record in c2 = c3. The problem is obvious.

One solution is to create a map from address record to cluster id and lock the map while working on it. This is the solution in this implementation. The trick here is that, most of the time, the program will actually be doing LCS, and will be locking the map only if the pair of records under comparison is qualifying for a match.

The code can be easily compiled on MinGW build for C++11 standard (posix version for Windows). Here is the compilation command:

g++ EntityFuzzyComparator.cpp -lpthread -std=c++11

By default, the program looks for file input_address.txt which should contain two tab delimited columns, first column is integer serving as id of the address and second column is the address itself. The output is recorded into out.txt file and composed of two columns — first is id of the address (taken from the input file), second is the id of the cluster it belongs to. Cluster ids serve no purpose other than to group the addresses.

As before, the extension on the file is ‘doc’ only for upload purposes — the actual file is C++ code.

]]>

But sometimes it’s not the case and you may have to use some kind of a proxy to derive unique count values. Generally, unique counts are not additive, so I give you a summary containing valid count of unique clients by geographic unit for a period of time, you can’t add those counts to derive correct total count of unique clients because one client could have been counted in at least two distinct geographic units (they moved at some point). But by adding unique counts obtained that way, you can get relatively close to the true count of unique values.

Why bother going through all the hoops to ‘measure’ the true unique count? Sometimes deriving true unique count is not an option because the state of the databases changed and historical data is not reproducible. Suppose some division of the business uses breakout of unique client count by geographic unit I described above, and they don’t provide total true count of distinct clients and you’re given a report 3 months old. There is no way for you to go back in time and derive the accurate count of unique clients because all the databases have changed and historical data is not available.

Suppose though, there are two dimensions, each with 10 possible attribute values. You’re given count of true unique members by values in each dimension but not overall. Your task is to derive a close measurement of total unique count. The idea then is to add up unique counts derived from dimension with a higher amount of duplication, to lower the error. Why ‘anisotropic’? In science, anisotropic implies dependence on direction of measurement. Loosely speaking, we can call the addition of unique counts to derive a total unique count to be an estimation, or a measurement, and the dimension of the unique count values to be direction. Then, depending on which dimension I choose for unique counts, I will have anisotropic measurement because my result will depend on direction (dimension).

The R simulation below illustrates just what happens. There is a total of 10 unique objects, each having dimensions B and C (A is object id). Each object has at least 3 distinct values in dimension B and no more than 3 in dimension C. Adding up unique counts obtained by grouping on dimension B will yield a closer estimate to true value of 10 than adding up the unique counts obtained by grouping on C.

```
r = NULL;
for (i in 1:10) {
# how many attributes in this dimension will this object have (at least 3)
j = 3 + sample(1:3, 1);
js = sample(1:10, j, replace=FALSE);
for (k in 1:j){
# how many attributes in this dimension (at most 3)
m = sample(1:3, 1);
ms = sample(1:10, m, replace=FALSE);
for (l in 1:m){
r=rbind(r, c(i, js[k] , ms[l] ));
}
}
}
r=as.data.frame(r);
names(r) = c('A', 'B', 'C');
write.table(r, 'clipboard',sep='\t');
write.table(aggregate(data=r, A ~ B + C, function(x) length(unique(x))), 'clipboard',sep='\t');
write.table(aggregate(data=r, A ~ B , function(x) length(unique(x))), 'clipboard',sep='\t');
write.table(aggregate(data=r, A ~ C , function(x) length(unique(x))), 'clipboard',sep='\t');
print(length(unique(r$A)));
```

]]>

So I actually scanned the first billion digits of pi (as before) and constructed their overall distribution. Then, I took all the completion runs > 100 in length, and constructed digit distribution on them as well. Then I compared the two.

First impressions – nothing unusual. Basically, as predicted, each digit is roughly 10% for both overall distribution and the high length completion run distribution. But then I noticed a curious percentage change, which is definitely unexpected for me – within abnormally long completion runs, digit 0 has increased proportion-wise by the same amount as digit 9 has declined. The same is for digits 1 and 8 respectively but not for other digits. Wow. At first we thought it might be an outcrop of magnetic rock, but all the geological evidence was against it. And not even a big nickel-iron meteorite could produce a field as intense as this; so we decided to have a look.

This is not something I expected, especially because random controls show no such redistribution when I compare completion run lengths. Generally, there is no need for one digit to decline in proportion in perfect correspondence with another, especially if we have 9 other digits to choose from. It’s a redistribution, so if there was a certain count of 0s to begin with, some of them can be converted to 1s, 2s, etc, but not necessarily 9s. Why would 0s and 9s be connected? Same for 1s and 8s. I obviously don’t have an explanation for this observation, so it gets curioser and curioser.

By the way, while I’ve been focusing on ’empirical’ analysis of decimal expansion of pi in terms of what I dubbed ‘completion runs’, I found out these runs are connected to a well-known mathematical problem called ‘coupon collector problem’, which asks to figure expected length of such a run. The solution is rather simple and it does predict that if one is given a problem to collect all 10 coupons (in our case digits), it will take, on average, 29 attempts before all of 10 have been seen at least once.

Here is the link:

http://en.wikipedia.org/wiki/Coupon_collector%27s_problem

]]>

Last month’s post focused on decimal expansion of pi – the all-important mathematical constant which has fascinated mathematicians for centuries and caused many-a-rhapsody on the beauty of nature (fill in the rest). The number has been computed through 1 billionth digit of precision (and probably beyond) and saved in text file here:

http://web.mit.edu/afs/sipb/contrib/pi/

Question: is pi random? As you probably can figure out from last month’s post – it didn’t seem so. The approach was, take pi sequence 3, 1, 4, 1 etc and create a sequence of completion runs. Each completion run happens once you’ve read all digits from 0 to 9 at least once. For every sequence of digits from 0 to 9, there is a corresponding sequence of completion runs. Last month’s study focused on first million digits of pi – this month it’s one billion. In the previous study, I structured the problem as if there were two sequences running in parallel: one is the sequence of pi digits, the other is a sequence of completely random digits. Both sequences are read in parallel, and completion run lengths are recorded as we progress through the sequence. The question is: can we discern if there are any anomalous-non-random behavior in decimal expansion of pi through data on completion runs? We already saw that for the first 1,000,000 digits of pi, there seemed to be a few more ‘abnormally’ long completion runs as compared with a corresponding control ‘random’ sequence. The ‘random’ property is of course questionable due to pseudo-random number generation. This, in fact, puts the whole approach into question as the ‘randomness’ is established using a pseudo-random numbers but that is a limitation will have to accept.

So, this month, I show the completion run length distribution for the first billion digits of pi and compare it with the same distribution for billion-length sequences obtained using rand() function in C – which is supposed to be ‘fairly’ random. In fact, I ran the random control twice but using the same random number generator with a different seed. Results are in the table. Completion run lengths are stratified by intervals of 10 and the count of runs in each interval is compared for pi and one of the random controls. Each frequency for pi is compared with the corresponding frequency for control.

There seems to be something interesting with respect to sign of the difference. In particular, when comparing the count of completion runs grouped into buckets of 10, the pi counts are higher than those coming from random controls. The two random controls (both obtained using rand() function in C standard library with different seed value) are predictably in agreement with each other. This is easy to spot just by looking at the count of completion runs for pi’s billion – 34,140,378 and compare that with the first control 34,135,390. If the random generator is indeed reliable, this would mean that pi has 4,988 more completion runs than a corresponding purely random sequence. Now, let’s generate another random sequence (control 2) and compare control 1 with control 2 to see the difference between completion runs. Well, it’s only 1,559. And, when comparing the distributions, there is a lot more uniformity between length buckets. Control 1 is higher than control 2 in 8 buckets and is lower than control 2 in 8 buckets for the total of no bucket-wise differences. But for pi, the situation is quite different, when compared with both control 1 and control 2. Pi seems to be higher than control 1 in 6 buckets (net) and higher than control 2 in 5 buckets. So, not only the overall count of completion runs is higher for pi, but the distribution of the run lengths is different.

Clearly something is going on either with pi or with the random number generator. Richard Feynman (whose name already showed up on this blog before) once spotted a sequence of six 9s pretty early in the decimal expansion of pi (at 762nd position), and it was called Feynman point.

http://en.wikipedia.org/wiki/Feynman_point

]]>

```
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <ctime>
using namespace std;
int main(){
ifstream inFile ("pi.txt");
ofstream outFile ("out.txt");
char c;
int i =0;
int a[10];
int g = 0;
srand(time(NULL));
for (i = 0; i < 10; i++) a[i] = 0;
while(!inFile.eof() ){
inFile.read( &c, 1 );
int j = c - 48;
// j = rand() %10; // comment/uncomment to simulate truly random sequence
a[j]++;
int k = 0;
for (i = 0; i < 10; i++) if (a[i] == 0) k = 1;
if (k) {
g++;
} else {
//cout << g + 1<< ' ';
outFile << g + 1 << endl;
g = 0;
for (i = 0; i < 10; i++) a[i]=0;
}
}
return(0);
}
```

What does the data show? Well, it’s interesting. Assuming random number generator employed in the code above produces true random sequence, it seems that maximal length of completion runs in the first million digits of pi is larger than that in the control random sequence. The table below shows counts of completion runs for pi and control random sequence. Every completion run length from pi has corresponding value for the control. For example, in my data, 33 (the length of first completion run on pi) is coupled with 15 (the length of first completion run from control sequence). Reading the table below, it follows that first million digits of pi contains a completion run with over 130 digits (it’s a 157 to be exact) and there is no such completion run observed for the control sequence. In fact, there is also a completion run of length 126 (between 120 and 130) and there is no such run observed in the first million ‘truly’ random digits.

Do you feel thrilled? Do we see a glimpse of ’empirical’ evidence that decimal expansion of pi is not truly random? Or do we see manifestation of imperfections in random number generators?

]]>

There could be several ways. In the end, the findings of your research will have to be accessible to non-technical, decision-oriented audience. It matters because many data-mining techniques involve esoteric methodology requiring considerable exposure to mathematics, statistics and computer science. Decision trees come to rescue – they are easy to understand and there are freely available implementations. I’m going to cover decision tree construction in R using rpart package together with some primitive tree plotting.

The data comes from a three column dataset – column C is customer id, column I is name of an item customer shopped for, column L is loyalty index (one value for every customer (although it repeats itself). The task is to use rpart package to create a regression tree predicting loyalty index depending on the items customer shopped for. First, we need to transform the dataset into usable form, i.e. crosstabulate with respect to items, so that every column corresponds to an item and the values are either 0 or 1 indicating if customer shopped for that item. The last column is loyalty index.

require(rpart); # package which will perform regression tree analysis

require(rpart.plot); # fancy plotting for trees

d = read.table(‘clipboard’, header=T, colClasses=”character”);

names(d) = (‘C’, ‘I’, ‘L’);

d$L = as.numeric(d$L);

t = table(d$C, d$L);

colnames(t) [1] = ‘C’; # t is result of crosstabulation, lacks necessary column name

rr=aggregate(d$L, by=list(d$C), max); #making a separate dataset with one line per customer

colnames(rr)= c(‘C’, ‘L’);

tm = merge(t, rr, by=c(‘C’)); #inner join

p = paste0(paste0(“tm$L ~ tm$'”), paste0(colnames(tm)[2:(length(colnames(tm))-2], collapse=”‘ + tm$'”)), “‘”); # most convoluted piece of code; creating formula for regression tree – necessary because the count of items could be very large

tr=rpart(as.formula(p),method=”anova”,data=tm); # since loyalty index is floating point, have to use “anova” method

prp (tr); #fancy tree plotting

]]>

Arules package in R has apriori algorithm which allows to construct implication rules for itemsets. Suppose you have some kind of itemset and items can show up in various groupings. You’d like to know if there is a any co-occurrence between items or sets of items. For example, if many groupings that contain item A also contain item B, we can take that as an implication rule (association rule). So, suppose your data consists of two columns in Excel: one column contains groupings and the other contains items observed in those groupings. Here is an example:

Groupings Items

A A

A E

A H

A I

A J

A K

A M

A Q

A U

A V

A X

B B

B E

B F

B I

B K

B L

B N

B O

B Q

B R

B V

B Y

So we have two groupings A and B and a number of items. For R’s apriori algorithm to work, it must see the data as a set of “transactions”: each transaction corresponds to a grouping above, each column corresponds to an item from the dataset. Whenever an item is present within a grouping / transaction, we need to specify 1 else 0 (so called binary matrix). In our example, we have:

Grouping A B E F H I J K L M N O Q R U V X Y

A 1 0 1 0 1 1 1 1 0 1 0 0 1 0 1 1 1 0

B 0 1 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1

Again, columns are items (that’s why there is so many of them in this example), rows are groupings (only two, just as in the data above) and values are counts or binary indicators, e.g. how many of item Y are in grouping B?. Now, the question is, given initial columnar data format, how do we get it to look like the one above? Copy the data in original form to clipboard. On R console type:

d = read.table (‘clipboard’, header=T); # [1]

t = table (d$Groupings, d$Items); # [2]

d = as.data.frame.matrix(t); # [3]

tr = as(as.matrix(d), “transactions”); # [4]

rules = apriori(tr, parameter=list(supp=0.95, conf=0.95, target=”rules”)); # [5]

inspect(rules); # [6] example output: {I, K, V} => {Q}

The above instructions will cross-tabulate the initial input (line [2]), construct the matrix (line [3]), convert it to transactions (an input which apriori package will understand, line [4]) and generate the rules. As you see, default parameters are support (i.e. frequency within input dataset) of 0.95 or higher, confidence of 0.95 or higher and I am not specifying lift parameter. Next time will discuss conceptual background behind association rule mining, in the context or R.

]]>