Recursive Partitioning in R: Example of Customer Loyalty Data Mining

Suppose you have data for customers, the items they shop for and customer loyalty index for every customer. Your task is to make sense of the data, i.e. is there a way to conclude anything meaningful about customer loyalty from the items customers are shopping for?

There could be several ways. In the end, the findings of your research will have to be accessible to non-technical, decision-oriented audience. It matters because many data-mining techniques involve esoteric methodology requiring considerable exposure to mathematics, statistics and computer science. Decision trees come to rescue – they are easy to understand and there are freely available implementations. I’m going to cover decision tree construction in R using rpart package together with some primitive tree plotting.

The data comes from a three column dataset – column C is customer id, column I is name of an item customer shopped for, column L is loyalty index (one value for every customer (although it repeats itself). The task is to use rpart package to create a regression tree predicting loyalty index depending on the items customer shopped for. First, we need to transform the dataset into usable form, i.e. crosstabulate with respect to items, so that every column corresponds to an item and the values are either 0 or 1 indicating if customer shopped for that item. The last column is loyalty index.

require(rpart); # package which will perform regression tree analysis

require(rpart.plot); # fancy plotting for trees

d = read.table(‘clipboard’, header=T, colClasses=”character”);

names(d) = (‘C’, ‘I’, ‘L’);

d$L = as.numeric(d$L);

t = table(d$C, d$L);

colnames(t) [1] = ‘C’; # t is result of crosstabulation, lacks necessary column name

rr=aggregate(d$L, by=list(d$C), max); #making a separate dataset with one line per customer

colnames(rr)= c(‘C’, ‘L’);

tm = merge(t, rr, by=c(‘C’)); #inner join

p = paste0(paste0(“tm$L ~ tm$'”), paste0(colnames(tm)[2:(length(colnames(tm))-2], collapse=”‘ + tm$'”)), “‘”); # most convoluted piece of code; creating formula for regression tree – necessary because the count of items could be very large

tr=rpart(as.formula(p),method=”anova”,data=tm); # since loyalty index is floating point, have to use “anova” method

prp (tr); #fancy tree plotting

Advertisements

~ by Monsi.Terdex on December 1, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: