20111107

Computational Complexity

Given two correct algorithms that both solve the same problem, how should we choose between them?

One reasonable answer:  choose the one that requires the fewest resources (space and time).  In our world of ever-increasing memory capacity, space is not much of an issue - it is more of a problem if you are writing software for a pacemaker where pysical space is a real constraint.  But for our purposes we will focus on the time requirement of an algorithm.

Actual time measurements are almost useless for comparing algorithms due to the "noise" - the computer is almost certainly doing other things (downloading updates, receiving email, etc) at the same time as running our algorithms - these "silent" operations going on in the background will distort any time measurement.

So we make a simplifying assumption: all fundamental operations (+,-,/,*, and, or, not) take the same amount of time, and it is fixed (constant).

Now to determine the time requirement of an algorithm we simply count the number of fundamental operations ... but this depends on the input.  Usually the amount of work the algorithm does increases as the size of the input increases.  Our goal in complexity analysis is to determine a relationship between the size of the input and the number of steps the algorithm uses.

n = integerbox("Enter an integer")   # integerbox is just an input method
for i in range(n):
    print 2*i


The first line creates an integerbox, gets an integer, and stores it in the variable n - let's call that 3 operations
The loop executes n times.  Each time through, it assigns a value to i, checks to see if that value is still in the required range, multiplies i by 2, then prints the result - let's call that 4 operations.

So the total number of operations = 3 + 4*n.

This gives a very precise relationship between the size of the input and the number of steps - in fact, far too precise.  For one thing, it would take forever to calculate the exact number of steps for a long and complicated algorithm.  For another, the exact number is often hard to determine.  For the first line, I said we should call it 3 steps - but creating an integerbox might take 10 steps, and getting the user's input might take 3 ... if so, the total number of operations would be 14 + 4*n

Furthermore, the number of operations to keep the for loop running properly might be 6*n.  All we can really be sure of is that the number of steps for this bit of code  = c + d*n , where c and d are constants and n is the size of the input.

Because we can't really be sure what those constants are, we simplify the time function (i.e. our measure of the relationship between size of input and number of steps) by ignoring all the constant terms, constant multipliers, and anything that looks like it isn't important.  We call this finding the order of the algorithm - note that this has nothing to do with putting parts of the algorithm in order - it is an "order of magnitude" sort of idea.

This may seem like we are throwing away so much information that there can't be anything useful left.  To see why this simplification is justified, consider some possible time functions for algorithms

T1(n) = 20*n + 5
T2(n) = 4*n + 50
T3(n) = n^2 + n
T4(n) = 2*n^2 + 50

n T1 T2 T3 T4
5 105 70 30 100
10 205 90 110 250
20 405 130 420 850
40 805 210 1640 3250
80 1605 370 6480 12850
160 3205 690

320 6405 1330

640 12805 2610


You can fill in the rest of the table if you want to, but from what is shown above a definite pattern is emerging.  Let's focus on the growth of each function.

When n doubles, T1  and  T2  both approximately double.  The larger n gets, the more pronounced this becomes.

Also, when n doubles T3 and T4 both increase by about a factor of 4.  Again, this becomes more apparent as n gets larger.

Similarly, when n goes up by a factor of 4 (say from 20 to 80), T1 and T2 go up by a factor of 4, but T3 and T4 go up by a factor of 16.

We can see that in general, if we multiply the size of n by some value k, T1 and T2 will also be multiplied by k, but T3 and T4 will both be (approximately) multiplied by k^2.
 

As n gets large, the "+5" and "+50" in T1 and T2 and T4 become irrelevant - so we get rid of them.

As n gets large, the "+n" in T3 also becomes irrelevant - so we get rid of that too.  In fact, when our time function is a polynomial we immediately get rid of everything except the term with the highest exponent.

So T1 simplifies to 20*n
      T2 ....................4*n
      T3 ....................n^2
      T4.....................2*n^2

Well that is certainly an improvement ... but we aren't done yet.  Notice that T1 and T2 have the same growth rate, even though one has a coefficient of 20 and the other has a coefficient of 4.  The same observation can be made for T3 and T4: different coefficients but same growth rate.  Since those coefficients aren't affecting the growth rate, we might as well ignore them too.

Now T1 has been simplified to n
         T2 .................................n
         T3 .................................n^2
         T4 .................................n^2

We have to stop simplifying now - it we get rid of any more information, there really would be nothing left.

Of course we cannot say T1 = n ... but we can say that T1 grows at the same rate as n.  Ditto for T2.  We define a set of algorithms, called O(n)  - this is the famous "big O" notation - as the set of all algorithms that grow no faster than n grows.  We write and say "T1 is in O(n)" and "T2 is in O(n)"

Similarly, T3 and T4 both grow at the same rate as n^2.  We write and say "T3 is in O(n^2)" and "T4 is in O(n^2)"



Formal Definition of "Big O"

Let f(n) and g(n) be functions.  If there exist constants c and n*
        such that for all n >= n*
                f(n) <= c*g(n)

then we say and write " f(n) is in O(g(n)) "


If we deconstruct this, we see that it is saying exactly what we said informally:

            for all n >= n*    ........ means that we are only interested in what happens when n becomes large

            f(n) <= c*g(n)    ........ means that f(n) grows at a rate no faster than the growth rate of  g(n)


Analysing Programs

For a sequence of statements, the complexity of the sequence is the complexity of the most complex thing in the sequence

For a loop, the complexity is the complexity of the body of the loop multiplied by the number of times the loop executes

For an "if", the complexity is the complexity of whichever of the two alternatives has the higher complexity




Bottom Line

To determine the computational complexity of a piece of code,
        - identify the fundamental step that is executed the most often
        - write down a function that relates the number of times that step executes to the size of the input
        - simplify that function as much as possible by discarding smaller terms and constant coefficients
        - what remains is the "big O" complexity of the code


But How Do We Deal With Recursive Functions ... ?

To Be Continued