## Singular Value Decomposition: Main Course

This time will continue exposition of R’s methods used for singular value decomposition. But, it must be realized, the reader should possess a fairly robust command of R in order to feel comfortable with these manipulations. While the reading itself will be fairly straightforward, I found a strong need to brush-up on my R skills throughout this research.

As it is, I found Internet to be rather short of very concise, take-the-bull-by-the-horns explanations of how to carry out singular value decomposition in R from start to finish for the purposes of data analysis deliverable (hopefully a useful one as well). There is plenty of discussion about the underlying theory – you will easily find elegantly rendered slides of linear algebra equations in crafty power-points strewn across the vast expanse of ‘googleable’ Internet. But, unless you really are mathematically trained or gifted, you’ll find these materials to be of little use.

There is some progress: take a look at this book by Skillicorn called ‘Understanding Complex Datasets’. It seems to be more approachable than many traditional books on data mining that I have seen.

Will be using source dataset referenced below throughout this post. First thing is to load source data – I prefer to use clipboard as much as possible – depending on where you store the file on your drive, there could be problems with path, so just open the attached file in any ASCII editor, copy to clipboard and issue the following command:

r = read.table(‘clipboard’, header=T); # I am obviosly assuming it’s self-explanatory

This data looks like below:

It should be pretty clear that something doesn’t make sense here – but believe me, you will find datasets like this in the real business setting. As in the prior post, the idea is that every entity (in this case client) has a certain attribute (only one) that it can have few values for. Examples may be customers who have been characterized by call center representatives. There could be multiple categories and each customer is allowed to have eight values maximum, but there are many potential values. So, in the case above, each number could stand for a distinct category. From the relational-model standpoint, there is clearly a problem here – why don’t we have a single column for client id, and single column for attribute values and then list them all for every client? In such dataset, there will be one record per member-category combination. Regardless of whether you’re ok with the data above, R is not, and you’ll have to transform the data in order to enable R to do singular value decomposition. Thus, as per prior preable post, let’s issue a few commands to transform the dataset above into relational form and explode it as full crosstab boolean matrix:

require(reshape);

m = melt(r, id=c(‘Client’)); #unpivot

t=table(m$Client, m$value); # creating crosstab boolean attribute indicator matrix for singular value decomposition (SVD)

require(svd); # the package with svd routines

s = svd(t); #s will contain object returned by svd function

Issue the following command:

names(s);

to see which components are available in s – they are U, V and D matrices, which are ordinarily returned and worked on in the process of decomposition. Singular values are in D matrix and are sorted in descending order. Will have to choose a few of them in order to produce compressed version of the source dataset. This will allow extraction of features otherwise unseen in the data before (at least that’s the hope.) To whet your appetite for dessert, take a look at the attached study which looked into Facebook’s user-like matrices using singular value decomposition – generally the idea was to reduce the number of dimensions from around 50k to just 100 and carry out regressional analysis on the resultant data.