## ID3: C++ Implementation, Forex (Foreign Exchange) Forecasting, First Experiments (Part 1)

When you hear about ‘data mining’, there is often a speak of classification, which there are many ways to do. One popular and relatively simple algorithm that got me interested is classification tree, and one of the most common versions of it – ID3 (iterative dichotomizer 3) developed by Ross Quinlan. Basically you have a huge number of training examples (rows), huge number of categorical (only) columns and one classifying column (also categorical) and your job is to identify combination of attribute values (slice of data) for which the value of classifying column has some very high proportion (classification accuracy.) Say I have 300 rows and 20 columns, all cells have a value of 0 or 1. Then I find that for columns 4, 7 and 12, for values 0, 1, 1, the classification column is 1 with a 90% accuracy and such accuracy is not observed for any other combination of columns. That is a significant discovery in the data (hmm… statistical significance?).

The question is, of course, where do the categorical variables come from? This is where, dear reader, you’ll have to fantasize (the imagination piece.) But no despair – the ideas for categorical columns are not hard to generate (it’s the good ideas that are hard to find). Let’s focus on the problem of foreign exchange (forex) prediction (forecasting). Given a pair of currencies (e.g. EUR/USD) and the time series which shows how the rate has been changing for some period of time, your task is to predict if we’re about to experience a ‘trend’, i.e. in the next few time moments, the value of the currency exchange rate will change ‘significantly’ in one or the other direction. The question is, what is the time interval and what is to be considered significant rate change? And the answer is that there are no analytic (precise) solutions. Here, science merges with art and becomes imprecise. I say, let’s make each row correspond to a time point. Suppose we take a number of different moving averages at this particular time point, each moving average having a specific fixed period. Then will compare each moving average to every other moving average and make a categorical variable with values -1, 0 and 1 corresponding to direction of change. Thus for n moving averages, we have n (n-1) /2 combinations. So let’s say I have 9 moving averages, then I have 36 categorical variables. For classification variable, will look forward 10 data points and if out of next 10, 8 are points are higher than the current point, will call it an upward trend, symmetrically opposite for downward trend.

All of this is very, very good, but the problem is – how to actually do it? The outline above is comparable to an academic computer science paper or a textbook chapter, where the principal investigator gives a high overview describing everything in terms of neat equations and ideas while graduate students do all the nitty-gritty. I’m sure if you’re reading this, you already have read a few chapters on basics of ID3 (informational entropy, etc) and such – I recommend reading Tom Mitchell’s Machine Learning or other data mining textbook with extensive coverage of ID3 and C4.5. The reason I’m so dismissive of such in-depth theoretical expositions is simple: textbooks lack portable implementations and real life examples, which are (in my opinion) much harder to produce than colorful presentation of core ideas and deliver much higher value to student community. Adding real value to data science is the point of this blog.

Where is the ID3 implementation? For that matter, how to even construct the set of feature examples?

There are commonly available implementations of ID3 (like R, MATLAB) but to do work on a data miner level, you’ll need a programmatic control. I tried to locate a reasonably well done ID3 implementation in either C or C++ and wasn’t able to find it (perhaps you’ll point me to it?) There are Java solutions, but I’m against using them since Java is not as fast as compiled C/C++ program, most of the scientific programs are written in Fortran or C/C++ and the language is simply not as flexible as C. So attached Word document contains code (it’s one cpp file) of my ID3 C++ implementation – you’re free to use/modify. I’m working on updates, so it’s not the last version. I’m using only standard library functionalities (nothing fancy) and am oriented to GNU compiler. Let me know if you find you’re unable to compile this implementation. My goal is for anyone with minimal C++ to take this code and use it, so you should be able to port to any platform with minimal effort. Now, I’m going to describe the class structure and use (to be continued…)

there are a few errors

s said this on August 13, 2013 at 5:23 pm |

So I gather you found a problem with the algorithm – it’s ok, I am not claiming it to be perfect. Care to elaborate what you found? E-mail is ok if you want.

Monsi.Terdex said this on August 15, 2013 at 11:25 pm |