Singular Value Decomposition: Prelude

As promised, today’s post will be about R: particularly not so straightforward data transformations which will be essential for carrying out singular value decomposition using less-than-friendly layout of the input data. It is true modern dialects of SQL allow many feats of data acrobatics but, still, you get a feel that database querying language was not designed originally with intention to wrangle data on such level. Let’s get specific.

Suppose you’re given a table where first column identifies a client and next 6 columns (2:7) show 6 values (some of which can be blank) of the same attribute. For example, the left column may be an id of a call received by a contact center where each representative is allowed to select no more than 6 tags (categories) which would best describe the call. Thus there is no difference between columns 2 and 7 as far as prioritization goes: for us, in order to deal with this data, will have to ‘unpivot’ it. The resultant dataset will contain two columns: call id and categories. This, essentially, will be normalizing (unpivoting) the data (which normal form?) but the actual purpose is to prepare the dataset for a construction of a very wide crosstab matrix, suitable for singular value decomposition.

#Open the dataset linked below and copy to clipboard. In R console, import the dataset as follows:

r = read.clipboard(‘clipboard’, header=T);

#You must link ‘reshape’ package:

require(reshape);

#You then have to use ‘melt’ command of the package to unpivot the dataset

#the first argument is the dataframe and second argument is a vector containing names of

#columns which will serve as the primary key of resultant dataset

m = melt(r, id = c(‘A”));

We’re almost there, the only thing left is to crosstab this dataset. Use ‘table’ command as below:

t = table(m$A, m$value)

As I am sure you infer, m$value is the column where R stored the unpivoted values, the other variable is m$variables which contains names of columns that have been unpivoted. Take a good look at t because it stores boolean matrix of indicators showing whether particular client had a particular tag. We have just effectively created a gigantic matrix where column names are all possible tags observed in the entire dataset.

 

unpivot_dataset.txt

 

Advertisements

~ by Monsi.Terdex on July 5, 2013.

One Response to “Singular Value Decomposition: Prelude”

  1. Reblogged this on Sutoprise Avenue, A SutoCom Source.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: