# macopt-a nippy wee optimizer

### What is wrong with the Numerical Recipes Conjugate Gradient optimizers

I used the conjugate gradient code in the Numerical Recipes book for several years to train successful neural networks. But this code is actually very poorly written. It's not just a question of their use of global variables and other ugliness. The optimizers can be very clumsy when applied to real functions.
• Step sizes in line minimization
Bear in mind that the step size is a number that multiplies the gradient vector to give a movement in parameter space. Bear in mind the possibility that the gradient may be `big' (eg proportional to a number of data points, N, with N=1000), whereas the typical desired movement in parameter space may be `small', of order 1, say.
In the NR algorithm they use a general purpose bracketing procedure to do the line minimization. The initial step size is always `1', even though this may be a very inappropriate step size. What is worse, if this initial step size is too big then the routine mnbrak makes the next step equal to `-1.6': NB, this is in the wrong direction! We know that we are expecting the minimum to lie in the positive direction. And it is an even bigger step than the first one, so it is likely to give even worse numerical problems. So this is a waste of computer time, and creates an initial bracketing interval that is far too big and lies 3/5 in the negative direction. Imagine, for the sake of discussion, that the true minimum is at a step size of about 0.000001 on each line search. Then the Numerical Recipes algorithm will waste a lot of time creating silly guesses -- about log_2 1000000 of them, in fact. So about 10-20 unnecessary function evaluations are done on every line search, whereas if the initial step size were 0.001, then only a few function evaluations would be needed to locate the minimum.

In macopt I rectify this in two ways. First the initial step size of any line minimization is inversely proportional to the gradient. And second, the constant of proportionality is kept as a static variable held in a special structure and adapted during the optimization.

• Another criticism of the NR code is that they do not use gradient information in the line search. They mention that it is possible to use gradient information, but their example code does not use it. In fact, with gradient information, it is easier to find the line minimum, because you can bracket the minimum with only two gradient evaluations.

In macopt I go the whole hog and make no use of the function value at all. This makes for a simpler program and means that one can minimize functions whose gradient is easier to calculate than the real thing. (There are examples of such functions in my work.)

• Finally, it is not necessary to locate the line minimum as accurately as linmin does. Their general purpose line minimizer gives huge precision, whereas for practical purposes a rough minimization is adequate.

In macopt I just bracket the minimum and then guess where it is by linear interpolation. (To put it another way, the line search looks for a zero crossing of the inner product of grad(f) and x where x is the line search direction.

### Summary of the macopt optimizer

I have written my own conjugate gradient algorithm that attempts to improve on the NR code. My algorithm is called macopt. It has the following properties:
• The optimized parameters, as in the numerical recipes code, are all contained in one double precision vector, e.g. x[1]..x[N]. (Note the offset of 1.)

• an adaptive step size is used for the initial step of the line search. The initial step is inversely proportional to the gradient, so that the actual movement in parameter space is of a roughly constant size (which is adaptive :-)

• gradients are used in the line search

• once the minimum has been bracketed, the line search terminates immediately, and a new line search commences using interpolation to estimate the location of the minimum and the value of the gradient there. Alternatively an extra gradient computation can be made here to be safe.

• typically, when the routine has adapted its step size, two gradient evaluations per line minimization are performed.

• macopt makes sanity checks to confirm that the line search direction and the gradient are consistent. If they are not then it resets its g and h vectors to the gradient.

• macopt has a structure containing arguments and pointers that it uses to keep track of vectors and gradients. The user can control some of these which define:
• the maximum number of iterations, and the max number of steps in a line minimization.
• the tolerance for the termination condition. The tolerance can be expressed in various ways, one of which is `covariant' (I think that is the right jargon).
• `verbose' can set a diagnostic level from 0 to 4. at level 0 the program is silent except when reporting errors.

• macopt is quite greedy with memory. If your parameter vector has n dimensions, then macopt needs about 8n doubles. i.e. it creates 8 vectors for storing gradients and stuff in. But hey, that's not as bad as variable metrics!
I find that this algorithm sometimes is ten times faster than the NR code. The Xerion group at University of Toronto have also written their own conjugate gradient optimizers, and they have adaptive conjugate gradient optimizers that only need 1.6 gradient evaluations per line search.

### Tar files: | New C version | C++ version - tar file |

| original C version | | older C version |

### Other versions

Matlab and Octave: (Mon Jun 28 2004) Iain Murray has made matlab and octave wrappers for macopt. | matlab wrapper | octave wrapper |
Notes from Iain: 1. For instructions, cd macopt_oct; cat README. The command make will download macopt.tar.gz. 2. Then run octave-2.1.57 and help macoptII. 3. I haven't wrapped the covariant version, and I haven't exposed every option to the end user.

Java: Macopt has been put into JAVA in the Phylogenetic Analysis Library | PAL mirror site | local copy of tar file, pal-1.4.tar.gz |

### Instructions for the New C version

Most of the original instructions, given below, are accurate. The only difference is that I organize my directories differently (for ease of maintaining executables on multiple platforms). When you tar xvf, you will get a directory called newansi. This directory must contain a directory called bin\$ARC, for example if \$ARC is i386, it should be called bini386. This directory should inherit its makefile from newansi/_Makefile thus:
``` newansi/bini386/makefile -> ../_Makefile
```
The above example link is created for you by the tar file. All the .o files and executables are put into bini386. To execute an executable, give the path to it (eg bini386/test_macII) or modify your path to include ./bini386.

Apologies for this minor complication, hope it works OK for you!

### Instructions for original C version

There is a demonstration program called test_mac which uses macopt to minimize a quadratic function. When you get this tar file, uncompress it, tar xvf it, (Note this creates a directory called ansi), and cd ansi, then type
``` make test_macII
```
Hopefully you will get an executable test_macII which, when executed, minimizes the function 1/2 xAx - bx, where A and b are given by the user interactively. [NB, you must give a positive definite symmetric matrix, e.g.
```2 1
1 2        .]
```

The program also makes use of the maccheckgrad function, to check (visually) that the function that claims to return a gradient really does so. In one column, the gradient is given, and in the other the numerical first difference of the objective function.

The anonymous ftp route for the tar file, if you need it, is: ftp www.inference.phy.cam.ac.uk; cd pub/www/mackay/c; binary.

### Messing with the defaults

You may wish to change the following variables.
1. In the makefile:
1. Optimization at compile time. Cut out the -O2 -funroll-loops flags to disable optimization. This will make things run slower but will make them easier to debug with many debuggers.
2. In the macopt_arg structure (default values set by macopt_defaults). [See macopt.h for further information, and see macopt_defaults in macopt.c.]
end_if_small_step
Defines whether the end of the optimization occurs when a small step is made, or when the gradient is small. Some users find things work well if you make the termination condition based on the gradient rather than the step size.
tol
Defines `small' in the above senses. I recommend using end_if_small_step=1, and setting tol using your knowledge of your parameter space. What is a `really small' change in a typical parameter?

Note that tol only defines when the optimization ends. It does not have any effect on what happens during the line searches.

lastx and lastx_default
defines typical distance in parameter space at which the line minimum is expected; both these should be set. the default is consulted if something goes badly wrong and a reset is demanded. It is not essential to give these good values, as the algorithm is adaptive, but the idea is that you should specify a typical expected step size here. Err on the small side if unsure.
itmax
Maximum number of line minimizations performed.
rich
Whether a gradient is evaluated at the beginning of every line minimization. If you can get away with this being 0, then the optimizer runs faster.
verbose
if verbose = 1 then there is one report for each line minimization. if verbose = 2 then there is an additional report for each step of the line minimization. if verbose = 3 then extra debugging routines kick in.

### Error messages

Failure to compile.
If it complains about srand being redefined, don't worry. We don't need srand.
If it complains about the return type of main(), tell it to be less pedantic.
Warning! maclinmin overran- inner product at 0 = 0.09674
This sort of error message is a problem. The inner product at the beginning of a line search should be negative, otherwise the whole method doesn't work at all.

If you get this error message a lot, either there is a bug in your gradient computing software (always make use of the check_grad routine to see if all is well with your gradient) or macopt is being too ambitious and needs to be a bit more conservative: set `rich' to 1.

### Questions from users of macopt:

We are using your macopt package on an application that we would like to release under the GNU Lesser General Pulic License (LGPL). Do you offer a license that will permit us to distribute your code under the LGPL?

Yes, as of Thu 15/8/02, the macopt releases in both C and C++ on my website are distributed under the LGPL; the associated example test files are distributed under the GPL.
Note however that three files in the tar file belong to other people:
rand.h is copyright by Radford Neal. (It is not actually used by macopt, so this doesn't matter.)
nrutil.c and nrutil.h are derived from the Numerical Recipes in C software library. These minor routines are used by macopt to handle memory allocation and deallocation.

David MacKay's work is largely supported by donations. If you find macopt useful, please feel free to make a donation.

is there now any need for mynr.h in the release - nothing in it seems to get used anywhere ?
I didn't initially realise how inherently macopt depends on gradients being easily computed. Is this really that often the case ?
Yes, macopt is built on the assumption that a gradient evaluation costs only about twice as much as a function evaluation. This is the case in all the statistical models I work with (neural nets, etc.)
If in fact function evaluations are relatively cheap and gradients are very expensive, then you might want to change to a different line minimizer. [A friend told me that lnsrch in the new edition of numerical recipes is a good line minimizer.]
However, Barak Pearlmutter points out that Gradients can always be calculated with at worst 5 times more effort, and typically just 2 times (even if there is an iterate-to-fixedpoint in the function evaluation itself, which complicates things a little.) [references] Thus it makes sense to always make a gradient-finding routine.

References concerning optimization and the ease of computing derivatives:
```@PHDTHESIS{Speelpenning1980a,
author = "Bert Speelpenning",
month = "January",
year = 1980,
title = "Compiling Fast Partial Derivatives of Functions Given by Algorithms",
school = "Department of Computer Science, University of Illinois at
Urbana-Champaign",
key = "Speelpenning1980a",
abstract = "This is the author's doctoral thesis. It starts by comparing
previous work in the area of symbolic differentiation of
algorithms. Specifically it considers the work of Warner in 1975,
Joss in 1976 and Kedem in 1977. The conclusions reached in this
discussion are that the work by Joss is the best in terms of
improvement. The author proceeds to describe how Joss' work can
be improved in terms of speed, accuracy and space. A package,
Jake, is described which is a compiler that takes a Fortran 66
input definition of a function. This input is limited in that
only one subroutine can be specified, and that certain Fortran 66
statements are disallowed. Jake is instructed on how to perform
its task by directives within the subroutine. Timing results are
provided on the performance of the code produced by Jake over
those where finite differencing is used.",
keywords = "point algorithm; precompiler; numerical results."
}

@BOOK{Griewank2000a,
author = "Andreas Griewank",
year = 2000,
title = "Evaluating Derivatives: Principles and Techniques of Algorithmic
Differentiation",
series = "Frontiers in Appl. Math.",
number = 19,
publisher = "SIAM",