This demonstration has four parts. These are fully described in
chapters 13-14 (approximately) of my draft book
"Information Theory,
Probability and Neural networks".
The demonstrations were made with octave files found here
but you can see the results just using gnuplot.
gunzip demo.tar
tar xvf demo.tar This unpacks a load of files into a directory
called neuron.
cd neuron
xrdb -load Xdefaults.gnu
gnuplot
load 'DEMO'
During this sequence, hit return when you are ready for the
next picture.
load 'DEMOLANG'
During this sequence, DO NOT hit return once the `30 samples'
sequence starts. All 30 samples are shown automatically
at one second intervals.
load 'DEMOLAP'
During this sequence, hit return when you are ready for the
next picture.
load 'DEMO'
y(x1,x2;w0,w1,w2) = 1/(1+exp(-(w0 + w1 x1 + w2 x2))).The objective function for the learning process is initially set to the error function
G(w) = - sum_n [ t log y + (1-t) log (1-y) ]We descend this error function and the weights evolve and end up growing without limit.
The result is an ever-steepening sigmoid function that perfectly separates the data.
We might view this outcome as undesirable Overfitting. This motivates the use of regularization.
M(w) = G(w) + alpha sum_i w_i^2What has gone before can be viewed as the special case of learning with weight decay rate alpha=0. We now try setting alpha=0.01, and find that this stabilizes the learning. The weights converge to a finite value. We can also set alpha to larger values and obtain yet smoother functions that fit the data even less well.
load 'DEMOLANG'During this sequence, DO NOT hit return once the `30 samples' sequence starts. All 30 samples are shown automatically in sequence.
To obtain Bayesian predictions we marginalize over the uncertain parameters w. This can be done using a Markov Chain Monte Carlo method. Here we use the Langevin Monte Carlo method, for which octave source code is found in my book, and also in the directory "octave". [Notes on how I used octave to make these demonstrations can be found in "README" but they are not presented neatly.]
The Langevin method is steepest descent with added noise and occasional Metropolis method rejections.
The demonstration shows the function performed by 30 samples from this simulation.
To obtain predictions we average these 30 functions together. The resulting predictions are shown by a contour plot. (Error messages are expected to pop up when the contour plots happen. Please ignore them.) The Monte Carlo predictions are contrasted with the predictions given by the standard "most probable" parameters found by the optimizer.
I view this demonstration a compelling argument in favour of using Bayesian methods for neural networks. What other method can give sensible predictions like these, apart from a Bayesian approach which takes into account the uncertainty of the parameters as described by the posterior distribution?
load 'DEMOLAP'