A simple neuron trained as a classifier

(c) David J.C. MacKay | Lecture course | Book | Other demonstrations |

Overview

This is a sequence of demonstrations which I use to explain what a neuron is; how it can be trained as a classifier by steepest descents; how regularization can improve the performance; and how Bayesian methods can be applied to such a supervised neuron, so as to further improve predictive performance.

This demonstration has four parts. These are fully described in chapters 13-14 (approximately) of my draft book "Information Theory, Probability and Neural networks".
The demonstrations were made with octave files found here but you can see the results just using gnuplot.

  1. Learning by steepest descents optimization without regularization.
  2. Learning with Regularization (weight decay).
  3. Learning and making predictions by the Langevin Monte Carlo method.
  4. Making predictions with the Laplace method (Gaussian approximations).

Requirements

This demonstration should work if you are running X-windows on a unix machine which has gnuplot.
If you have got gnuplot on some other system, it may work.
If you have not got gnuplot, I recommend you take a PC and put linux on it.

To run the demonstration:

  1. get the demonstration files (449K) by clicking here http / ftp or by anonymous ftp to wol.ra.phy.cam.ac.uk, cd pub/www/mackay/itprnn/code/neuron, binary, get demo.tar.gz.
  2. unpack thus:
           gunzip demo.tar
           tar xvf demo.tar
    This unpacks a load of files into a directory called neuron.
  3.        cd neuron
  4. modify your X windows defaults so that the gnuplot colours and fonts come out right.
           xrdb -load Xdefaults.gnu
           
  5.        gnuplot
           
  6. To run the first two parts: (Steepest descents optimization)
           load 'DEMO'
           
    During this sequence, hit return when you are ready for the next picture.
  7. To run the third part: (Langevin Monte Carlo method)
           load 'DEMOLANG'
           
    During this sequence, DO NOT hit return once the `30 samples' sequence starts. All 30 samples are shown automatically at one second intervals.
  8. To run the third part: (Predictions by Laplace method)
           load 'DEMOLAP'
           
    During this sequence, hit return when you are ready for the next picture.

Commentary


    load 'DEMO'  

  1. Learning by steepest descents optimization without regularization.

    In the first part, a single neuron is trained using a steepest descents learning algorithm to solve a classification problem with 10 data points. The data points are in a two dimensional space (x1,x2) and each has a label t = 0/1. The neuron performs the function
      y(x1,x2;w0,w1,w2) = 1/(1+exp(-(w0 + w1 x1 + w2 x2))).
      
    The objective function for the learning process is initially set to the error function
      G(w) = - sum_n [ t log y + (1-t) log (1-y) ]
      
    We descend this error function and the weights evolve and end up growing without limit.

    The result is an ever-steepening sigmoid function that perfectly separates the data.

    We might view this outcome as undesirable Overfitting. This motivates the use of regularization.


  2. Learning with Regularization (weight decay)

    The objective function is
    M(w) = G(w) + alpha sum_i w_i^2
    
    What has gone before can be viewed as the special case of learning with weight decay rate alpha=0. We now try setting alpha=0.01, and find that this stabilizes the learning. The weights converge to a finite value. We can also set alpha to larger values and obtain yet smoother functions that fit the data even less well.
    To run the third part: (Langevin Monte Carlo method)
    load 'DEMOLANG'       
    During this sequence, DO NOT hit return once the `30 samples' sequence starts. All 30 samples are shown automatically in sequence.
  3. Learning and making predictions by the Langevin Monte Carlo method.

    We now consider the predictions made by the solution found by the optimizer with weight decay. Are they reasonable? Should the predictions at point B be as strong as those at point A?

    To obtain Bayesian predictions we marginalize over the uncertain parameters w. This can be done using a Markov Chain Monte Carlo method. Here we use the Langevin Monte Carlo method, for which octave source code is found in my book, and also in the directory "octave". [Notes on how I used octave to make these demonstrations can be found in "README" but they are not presented neatly.]

    The Langevin method is steepest descent with added noise and occasional Metropolis method rejections.

    The demonstration shows the function performed by 30 samples from this simulation.

    To obtain predictions we average these 30 functions together. The resulting predictions are shown by a contour plot. (Error messages are expected to pop up when the contour plots happen. Please ignore them.) The Monte Carlo predictions are contrasted with the predictions given by the standard "most probable" parameters found by the optimizer.

    I view this demonstration a compelling argument in favour of using Bayesian methods for neural networks. What other method can give sensible predictions like these, apart from a Bayesian approach which takes into account the uncertainty of the parameters as described by the posterior distribution?


    load 'DEMOLAP'      

  4. Making predictions with the Laplace method (Gaussian approximations).

    An alternative approximate implementation of Bayesian methods for neural networks makes a Gaussian approximation to the posterior distribution. This approximation is represented by a contour plot and contrasted with the monte carlo samples. The predictions are qualitatively similar to those found by the Monte Carlo approach.

David MacKay <mackay@mrao.cam.ac.uk>
Last modified: Fri Aug 29 16:24:05 1997