## Genetic Programming (GP) Experiments :: Part I

Long time ago I got fascinated with artificial intelligence technologies. In the 90s, neural networks and genetic algorithms were all the rage. Popular science magazines would sometimes describe, at a very high level, how artificial neural networks were secretly used by trading firms to predict stock price movements (read: more research grants for labs working in neural network paradigm). In one of the excised fragments of Terminator 2, Joe Morton (Mike Dyson) describes Terminator’s processor as based on artificial neural network (sweet dreams). Another panacea of the decade was genetic algorithms – supposedly by picking the best solutions and recombining them in biological fashion, you can ‘breed’ a population of solutions which outperform a randomly generated population.

This latter technique got my attention because output of a neural network is generally scaled to some finite interval (like -1.0 to 1.0) whereas stock price can vary outside of those boundaries and does not have any theoretical limit. One could say that only direction of change is important, so the -1 would mean the price going down and +1 that price goes up, but what if the goal is to predict an actual value (for whatever reason) and not the direction of change?

Genetic algorithms, while providing for arbitrarily scaled output suffer from another drawback :: the length of solution (i.e. count of genes in a chromosome) is fixed. Genetic programming (GP) takes care of that by evolving solutions in the form of syntax trees.

I started experimenting with GP back in June 2010 but quickly dropped the effort because I attempted to create the entire infrastructure (i.e. GP operations such as crossover, mutation, etc) from scratch. Even back then, there were freely available GP libraries. So recently, I decided to resurrect GP exploration efforts, and compiled some sample C++ code using library below. From programmer standpoint, I believe studying, testing and manipulating someone’s code is a much more intellectually demanding activity than designing your own solution (arguably it is also more applicable job-wise as you are much more likely to be given some already existing code to fix as opposed to creating an entirely new solution). But that is water under the bridge.

http://www0.cs.ucl.ac.uk/staff/ucacbbl/ftp/weinbenner/gp.html (the library seems to have been developed by Thomas Weinbrenner, but I can’t find a recent reference).

Naturally, the specific problem of interest that got my attention was symbolic regression — this is for forex data analysis (or any financial time-series prediction). The idea is, GP will receive pairs of date/time and closing price combinations and will evolve approximator function for the price (which will probably not depend on actual date/time). This, of course, is a simplistic approach — one can devise a sophisticated dataset consisting of ‘features’ (like direction of moving average comparisons) and trend indicator — GP will then construct an approximator for trend indicator as function of those features.

There are many fascinating questions that come up as you go through GP experiments. The one thing you quickly learn is that, they are just that — experiments. Pure analytical approach yields very little progress in GP — changing population size or mutation probability may or may not positively impact the fitness of your final solution. Even the definition of fitness itself is something you will have to create. Some can argue this spells doom for all evolutionary computing since evolved solutions cannot, by themselves, be explained. Others would say experimental approach opens up fertile ground for research efforts. Without getting into these questions, GP has been around for some time and is likely to stay and is worth getting dirty with.

So, what was done? The symbolic regression program (symbreg.cc) was modified to include sin(x) function in its function set and to include special print out function for the solution which allows to translate ADF (automatically defined function) into actual string representation. GP had to approximate the following function: sin(x) ^ 4 + sin(x) ^3 + sin(x) ^ 2 + sin(x).

Why was special print out function introduced? If you download the original library referenced above, compile and run symbreg, it will produce data.dat file which contains best fitness individual by generation. The problem is, however, that best fitness individual would be described like below:

Best of generation 20 (Fitness = 0.00182708, Structural Complexity = 25)

GP: ADF0 (sin(x),x%x)

ADF0: x1*x1*x1*x2*(x1+x2)+x1*x1+x2*x1

Here, we’re seeing that root node of the best individual is ADF of two arguments and, on the second line, we have definition of the ADF. This, however, is not very useful, as I would have to take definition of ADF, replace every instance of x1 with sin(x) and x2 with x % x (which, by the way, stands for protected division and not modulo as many of you would think). This obviously can be done easily for small examples like this, but more complicated examples get too involved. Why do we need such representation? If you want to observe actual solution in action, you would most likely want to put it as a complete formula in R or Excel right next to the actual formula to compare the diffences.

Next steps:

– adding terminals with constant values (yes, symreg from the link above does not include that)

– add lag operators which will read time series a specified number of steps backwards. Currently the system understands only the most simplistic definition of the terminal, i.e. x means literally current value of time series, not the value 1 time point away, etc.

– add ability to read multiple time series

– add boolean functions, like ‘and’ or ‘xor’ etc

– add conditionals

Here is the ‘doc’ file which actually ‘zip’ with the full content of the project:

Symbolic Regression_ Genetic Programming_ CPP.zip

To compile and run:

.\g++ -w -c gpc\*.cc

.\g++ -o symbreg.exe *.o

symbreg.exe

Below is good reference to GP paradigm — besides, of course, the seminal work by Koza.