Puzzle: Analysis of Algorithms: Solution

The explanation for the complexity curve you saw in the last post in not trivial. I’ve granted you that it was a sorting algorithm and that something may be going on with the data. At first, you’d need to think through which kind of a sorting algorithm is in operation here. There are many types and they differ in complexity – some are n^2, some others are n-lg-n, some are even linear. Examining the data in an Excel chart, first thing that strikes is an apparent improvement in performance as the input size exceeds 1650. This, obviously, is against common sense – if your problem size is larger, how can you solve problem faster? Unless some key feature of the data is also a function of data size… And what is data size? Suppose you have 20 strings, each of which is 10 characters long, then total size is 200 bytes. But that is not element count, which is still 20. Typically computer science textbooks refer to sorting problem size in terms of the count of elements.

Another obervation is arrived at by playing around with Excel’s trend fitting mechanisms – seems like polynomial curve of degree 3 is a nice fit for the data. Now, degree 3 is only due to the anomaly noted above, but if you limit the data to the interval from 0 to 1,210, you get a pretty good fit with degree 2. This would suggest the algorithm has quadratic complexity. What is the most famous quadratic algorithm, often presented as first algorithm for undergraduate students? Right, it’s bubble sort…

But what about the plateau and subsequent improvement in performance? The key is to assume that we’re dealing with string comparison and that strings may not be equal in length to each other. String comparison terminates as soon as a non-matching character is found, so even if you have very large strings, if the distribution of characters within them changes with input size – favoring strings where character mismatch occurs relatively early-on in the sequence, then the comparison will terminate rather quickly, thereby creating an artificial improvement in performance. It may also be that the actual length of each string declines as the count of strings increases.

I guess the key take-away of this problem is non-linear thinking. We’re often taught to arrive at one particular, right answer, but in real life, there is typically a multitude of potential answers. The code below generated the time series for this problem. Length of string decreases linearly with problem size, but not fast enough to counteract deterioration in performance in the early stages. The distribution of characters doesn’t change, but could very well have been the culprit.

#include iostream
#include fstream
#include vector
#include ctime
#include stdlib.h
#include math.h


using namespace std;
unsigned int nodeCount;
ofstream outFile("out.txt");
int * generateInversions(int n, int ic){
      int * ar = new int[n];
      for (int i = 0; i n; i++){
      ar[i] = i;
      }
      int i = 0;
      while (i ic){
            int a = rand() % n;
            int b = rand() % n;
            if (a != b && ar[a] < ar[b]){
                  int t = ar[a];
                  ar[a]=ar[b];
                  ar[b]=t;
                  i++;
            }
      }
      return(ar);
}


int comp(int a, int b){
      if (a b) return (-1);
      if (a b) return (1);
      return(0);
}


int compStr(string a, string b){
      for (int i = 0; i a.length(); i++){
            if (a[i] b[i]) return (-1);
            if (a[i] b[i]) return (1);
      }
      return(0);
}


void bubbleSort(string * ar, int n){
      clock_t pc = clock();
      bool b = true;
      for (int j = n-1; j 0 && b; j--){
            b = false;
            for (int i = 0; i j; i++){
                  if (compStr(ar[i],ar[i + 1])0){
                        string t = ar[i];
                        ar[i]=ar[i+1];
                        ar[i+1]=t;
                        b = true;
                  }
            }
      }
      outFile clock() - pc endl;
      outFile.flush();
}
string getRandomStr(int n){
      string s;
      for (int i = 0; i n; i++){
            double r = (double) rand() / (double) RAND_MAX;
            s += (char) ((int) ( r * 26) + 65);
      }
      return (s);
}
void testSort(int n){
      string * a = new string[n];
      int l = 20000 - n * 10 ;
      for (int i = 0; i n; i++){
            a[i]= getRandomStr(l);
      }
      bubbleSort(a, n);
      delete [] a;
}


int main (int argc, char ** argv){
      int n = 2000;
      for (int i = 10; i n; i+=10){
            testSort(i);
            cout '\r' (double) i / (double) n;
      }
      return (0);
}

Advertisements

~ by Monsi.Terdex on June 21, 2013.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Normal Boy

Nothing out of the ordinary

Data Engineering Blog

Compare different philosophies, approaches and tools for Analytics.

%d bloggers like this: