Principal Component Analysis :: Step by Step Guide in R

This month I decided to fully dedicate myself to principal component analysis (PCA). As many of you know, Internet is full of tutorials on PCA/SVD, except there is one problem — the moment when a novice has to translate the abstract language of linear algebra into implementation, things seem to go south. One may not be sure about his abilities, there could be computational errors etc. Without a step-by-step guide, it’s difficult to teach yourself PCA.

‘A User’s Guide to Principal Component Analysis’ by Edward Jackson comes to rescue. It has detailed exposition on how to perform PCA, along with intermediate stage results so that learner can ensure they’re on the right track. Invaluable work for a student of such an important technique.

Perhaps I should make a brief note on why PCA is important — it’s one of the most popular methods of multivariate data analysis. Suppose you have a dataset which describes closing price of a selection of stocks by day. You can create a correlation matrix of stocks and perform PCA (since different stocks have different average price, it’s a good idea to use correlation matrix vs covariance). The principal components will show most important dimensions — each dimension will have negative and positive contributions from each stock. Create a scree plot to see how much variance is accounted for by each eigenvector. Select a cutoff point and approximate your daily data using just a few eigenvectors. Then see which of your data points (days, in this example) are outliers, i.e. cannot be explained by the selection of eigenvectors.

I would recommend reading chapter 5 — it requires an audiometry dataset, which I include below. In addition, I include my R code which should enable you to produce y scores, x^ (i.e. original data approximated using selection of PCs), residuals and the scatter plot on page 120.

http://www.amazon.com/A-Users-Guide-Principal-Components/dp/0471471348/ref=sr_1_1?ie=UTF8&qid=1406473604&sr=8-1&keywords=user+guide+to+principal+component+analysis

###############################################
# Author:     Monsi Terdex
# Date:    07/27/2014
#
# Descr:
#
# This is R implementation of PCA
# using audiometry dataset in Chapter 5 in
# User’s Guide to Principal Component Analysis
# by Edward Jackson (arguably the most thorough
# practical introduction to PCA)
#
# Warning: this code is not optimized,
# and is, in fact, deliberately verbose
# and non-efficient to show how to perform
# these computations step by step in R
#
# The results match those reported by Jackson
###############################################

# dSource is the audiometry data object,
# load it using RData file

d = dSource;

for (i in 1:9) {
d[, i] = as.numeric(d[, i]);
}

# tb0 is the ‘final’ data source used below, i.e. you can
# in theory, substitue your own data source into tb0 and the
# downstream logic should work just fine

tb0 = d[,2:9];

# creating a correlation matrix
# This is due to unequal variance amongst the variables
# otherwise may want to use covariance matrix
# Note that logic below IS specific to correlation
# matrix and will not work for covariance matrix

cr = cor(tb0);

# computing eigen vectors and eigen values of the correlation matrix

evv.cr = eigen(cr)$vectors;
evr.cr = eigen(cr)$values;

# computing covariance matrix just in case + the corresponding eigen
# vectors and values

cv = cov(tb0);
evv.cv = eigen(cv)$vectors;
evr.cv = eigen(cv)$values;

# scree plot matches Figure 5.1 on page 112:

plot (evr.cr,pch=15, cex=1, xlab=’Root number’, ylab = ‘Characteristic root’);

# computing D^-1 matrix as defined in the text
# this is diagonal matrix of standard deviations of the
# original data. For D^-1, the reciprocal of each sd is used

stDev = diag(1.0 / apply(tb0, 2, sd));

# computing D matrix
stDevInv = diag(apply(tb0, 2, sd));

# will be adjusting the original eigen vectors of the correlation
# matrix, so saving them separately.

evv.cr.adj = evv.cr;

# taking count of columns in the original dataset tb0
# this is 8 for the audiometry example

nn = ncol(tb0);

# the count of eigen vectors to be used to approximate the original
# data. This count is 4 for the example in the text

nns = 4;

# dividing each eigen vector of the correlation matrix by
# square root of its eigen value
for (i in 1:nn) {
evv.cr.adj[,i] = evv.cr.adj[, i] / sqrt(evr.cr[i] );
}

# doing the same for the inverse case (this is to compute V* matrix
# in the text

evv.cr.adj.inv = evv.cr;

for (i in 1:nn) {
evv.cr.adj.inv[,i] = evv.cr.adj.inv[, i] * sqrt(evr.cr[i] );
}

# erasing the eigenvectors we do not need
# For audiometry example, only first 4 vectors are needed
# the rest are set to 0

evv.cr.adj[, (nns+1):nn] = 0;

# computing the W and V matrices as per text

wr = stDev %*% evv.cr.adj;
vr = stDevInv %*% evv.cr.adj.inv;

# this is adjustment of the original data for mean values
# there is a more elegant way to do this in R using ‘scale’
# function but just want to demonstrate so that it’s very clear

d.adj = tb0;

# going column by column
for (i in 1:nn) {
d.adj[, i] = d.adj[, i] – mean(d.adj[, i] );
}

# y scores computed using W matrix and match those on page 117 (Table 5.13)
# except sign but the author cautioned not to panic since
# this is a result of algorithm employed

yScores = t(wr) %*% t(d.adj);

# xhat will have the same shape as y scores, so just copying
# them here as placeholder

xhat = yScores;

for (i in 1:nrow(tb0) ){
xhat[, i] = ( vr %*% yScores[, i] ) +  as.matrix( apply( tb0, 2, mean) ) ;
}

# transposing the matrix
xhat = t(xhat);

# adjusting for standard deviation: as before, there is
# a more elegant method in R to do this but
# for now, just need to show the mechanics

# just copying original data as place holder
rr1 = tb0;
rr2 = tb0;

# rr1 is the residuals in standard units, i.e. divided by
# standard deviations of original data columns

for (i in 1:nrow(tb0)) {
rr1[i, ] = (tb0[i,] –  xhat[i,])/ as.matrix (apply(tb0, 2, sd ) ) ;
}

# predicated (i.e. xhat) values in standard units
for (i in 1:nrow(tb0)) {
rr2[i, ] = xhat[i, ] / as.matrix (apply(tb0, 2, sd ) ) ;
}

# scatter plot below matches one on page 120 of the text
# Figure 5.2
plot(rr1[, 8] ~ rr2[, 8], cex=1.0, pch=19, xlab= ‘x8^ / sd(x8)’, ylab= ‘(x8 – x8^) /sd(x8)’);

# Q values computation
# residuals are in rr1

q = numeric(nrow(tb0));

for (i in 1:nrow(tb0) ) {
rr5 = as.matrix(rr1[i, ]);
q[i] = rr5 %*% t(rr5);
}

# now computing critical Q, page 36

theta1 = sum(evr.cr[ (nns+1):nn]);
theta2 = sum(evr.cr[ (nns+1):nn]^2);
theta3 = sum(evr.cr[ (nns+1):nn]^3);
h0 = 1 – 2 * theta1  * theta3 / (3 * theta2 * theta2);

cAlpha = qnorm(0.95, mean=0, sd=1);
q.critical = (cAlpha * sqrt(2.0 * theta2 * h0 * h0) / theta1 + theta2 * h0 * (h0 -1 ) / (theta1^2)  + 1 )
q.critical = q.critical ^ (1.0 / h0);
q.critical = q.critical * theta1;

# q.critical is 2.47 matches that on page 116 for the audiometry example

# computing the Hotelling T^2 values using the
# number a predetermined number of eigenvectors to apprximate
# the data (i.e. nns) which is 4 for the audiometry example

tsq = diag( t(yScores[1:nns,  ]) %*% yScores[1:nns, ] )

# computing critical T^2 value — if an observation exceeds
# this critical T^2 value, it’s an outlier
# page 116 of the text indicates the value should be
# 10.23 but the calculation below gives 10.17 — this could be due
# to a data entry error on my part or perhaps the author used data with higher
# precision than what is presented in the text (he seems to indicate that)
# on page 105 (‘for simplicity in display, fewer digits will be presented than
# would normally be required’)

df1 = nns;
df2 = nrow(tb0);

tsq.critical = qf(0.95, df1, df2 – df1) * df1 * (df2 – 1) / ( df2 – df1);

# creating a binary vector of observations which exceed critical T^2 value
tsq.compare = ifelse(tsq > tsq.critical, 1, 0);

# creating final dataset comparable to that on page 117
final = cbind (t(yScores), tsq, tsq.critical = rep(tsq.critical, df2), tsq.compare, Q = q, Q.critical = q.critical );

# selecting outlier observations based on T^2
f = final[ final[, which(colnames(final) == ‘tsq.compare’ )] == 1, ]

# use opinionitis encryptor post for the following

 

Tobk qk i dgmz imbj qr oox xnaw, ujpphrelz tmlo KL4 cjpphrevd (bma awruxwn yzeb) wlsaesdm ‘Grciiwz’. Gox bddt ek znjnmsz xs ‘llo’ ctwljfqgz sjd dvwdmd qv fox xkds.

PCA

 

Audiometry RData (Chapter 5)

Audiometry_01.RData

Advertisements

~ by Monsi.Terdex on August 1, 2014.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: