ID3: C++ Implementation, Forex (Foreign Exchange) Forecasting, First Experiments (Part 2)

(… continued)

In the last post on this, I introduced main ideas behind Forex (foreign exchange) currency rate forecasting using ID3 algorithm and stopped right before diving into C++ implementation. Essentially the feature matrix consists of columns (attributes) and rows (examples). Since ID3 operates only on categorical variables, I made all values to have type string. Each row is a vector of strings (vector <string>) while each collection of rows forms the feature matrix (vector< vector<string> >). Obviously the number of columns in each row has to be the same. My implementation will construct set of attributes (class Attribute) and their values (class Value) based on the first row (which must be a header row) in the feature dataset. Again, the first row should have column names. You start off by creating your example matrix, then instantiate a DataSet object and assign your feature matrix to the ‘records’ member of dataset as below:

Dataset * d = new Dataset;
d->records = features;
d->init(); // let the Dataset extract attributes and values

Remember that training data is not used in its entirety – some of it is used to derive high accuracy classifier predictions but some other part must be used to verify the statistical credence of these hypotheses. For example, I may split my dataset 50% training, 50% validation. Suppose I discover that when columns 4, 7 and 12 have values of 0, 1, 1 respectively, the classifier variable has 90% of being 1. I’m ecstatic. Next I check my enthusiasm and verify the prediction accuracy for validation dataset, only to discover that it’s 10%. And finally I check the underlying numbers behind percentages: 9 out of 10 and 1 out of 10, when the full dataset contains 100k rows? The elation was unwarranted, the discovered combination of factors was a fluke. Back to the drawing board.

I recommend that prior to calling init(), you randomly shuffle the rows of your feature vector:

std::random_shuffle (features.begin()+1,features.end());

Then assign the start of the validation dataset as here (50%/50% split):

d->validationStart = 1 + (features.size()- 1) / 2;

The above expression takes into account the fact that there is a header row ( features.size()- 1 ) and we need to shift by 1.

The main program ends with these instructions:

vector<Pair> vp;
entropyTree(d, NULL, vp);

outFile << endl << endl << endl;
traverseTree(d, vp, d->root, 0);

‘Pair’ class has following definition:

class Pair{
public:
int attributeIndex;
string actualValue;
};

I use collection (vector) of Pair objects throughout my code to specify filter (using Excel’s parlance) on the dataset. For example, suppose I need to specify a slice of data where attribute 1 has value 10, attribute 2 has value 3 and attribute 7 has value 6. Then, programmatically in my C++ implemenation, this filter is represented as vector of pairs which look like

1, “10”

2, “3”

7, “6”

Before examining a row, the algorithm will check if the attribute-value combinations on the row satisfy filter. This is not an efficient way of doing it – I don’t use hash tables or binary search trees to speed things up, and I don’t use them for a good reason – they demand additional coding and obfuscate implementation details. For the purposes of playing around with moderately sized datasets, an inefficient but readable code would work much better. Anyway, before the algorithm starts, the filter should be empty so that it considers the dataset in its entirety.

The main call to work is

entropyTree(d, NULL, vp);

Here, you’re calling a recursive subroutine on the dataset. The subroutine will first compute overall entropy using classification column and will identify the attribute which results in the most significant entropy reduction (this attribute will become attached to the root node of the tree). For every value of this attribute, it will call itself once more (hence recursive), this time with the filter for that attribute-value pair. The process will go on until it gets a slice with no variance or there are no more attributes to go through. At the end, you have a full tree with frequency (class Frequency) objects which describe the most frequent value in the slice specified by the tree leaf and the corresponding frequency in the validation dataset.

This is a very programmer friendly, transparent implementation of ID3 which you can just stick into your C++ program and gain full programmatic control. As mentioned above, it’s inefficient but readable. In the next post, will discuss the actual results of forex data forecasting using this algorithm.

Advertisements

~ by Monsi.Terdex on March 15, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: