

`An instant classic, covering
everything from Shannon's fundamental theorems to the postmodern theory
of LDPC codes. You'll want two copies of this astonishing book,
one for the office and one for the fireside at home.'
Bob McEliece, California Institute of Technology




New (2009) from David MacKay:
Sustainable Energy  without the hot air
on sale now!

NEW (2012) Videos 
16 lectures by David MacKay

David J.C. MacKay
Information Theory, Inference, and Learning Algorithms



Cambridge University Press, 2003


 ISBN13: 9780521642989  ISBN10: 0521642981
How does it compare with Harry Potter?
for teachers: all the figures available for download
(as well as the whole book).

Information Theory, Inference, and Learning Algorithms
Textbook by David J.C. MacKay
Published by Cambridge University Press, 2003.
[want to choose a nearer web site?

Cambridge, Europe
 Toronto, North America ]
In 2003 this book will be published by CUP.
It will continue to be available from this website
for onscreen viewing.
Want to give feedback on the book, or report typos?
Great, to ask a question about the book please use this FAQ;
to report a typo, mail me. THANK YOU!
Corrections for draft 3.1415
Want to ask a question?
 please click to submit a query to the FAQ about
the book,
or the software.

Information Theory, Inference, and Learning Algorithms
(Hardback, 640 pages, Published September 2003)
Order your copy
Price: £35.00 / $60.00 from CUP UK/USA amazon.co.uk/.com/.ca/.co.jp  Barnes & Noble USA  booksprice.  fetchbook.info  allbookstores  biggerbooks  blackwells  directtextbook  kalahari.net (South Africa)  Special Paperback edition for South Asia.
Download the book too
You can browse and search the book on Google
books.
You may download
The book in one file
(640 pages):
Notes:
 Version 6.0 was released Thu 26/6/03; the book is finished.
You are welcome to view the book onscreen. Version 6.0 was used for the first printing,
published by C.U.P. September 2003.

Version 6.6 was released Mon 22/12/03; it will be used for the second printing,
to be released January 2004.
In this second printing,
a small number of typographical errors were corrected,
and the design of the book was altered slightly.
Pagenumbering generally remains unchanged,
except in chapters 1, 6, and 28,
where
a few paragraphs, figures, and equations have
moved around.
All equation, section, and exercise numbers are unchanged.
 Version 7.0 is the third printing (November 2004).
Its only differences from the 2nd printing are a number of corrections,
and the renaming of the chapter `Correlated Random Variables' to `Dependent Random Variables'.
Copyright issues: The book is copyright (c) Cambridge University Press.
It has been available in bookstores since September 2003. The cover price in 2003 was 30 pounds (UK)
and $50 (USA); in 2006, 35 pounds and $60 (USA).
Now the book is published, these files will remain viewable on
this website. The same copyright rules will apply to the online copy of the
book as apply to normal books. [e.g., copying the whole book onto paper is
not permitted.]
History:
Draft 1.1.1  March 14 1997.
Draft 1.2.1  April 4 1997.
Draft 1.2.3  April 9 1997.
Draft 1.2.4  April 10 1997. Margins altered so as to print better
on Northamerican paper
Draft 1.3.0  December 23 1997.
Draft 1.9.0  Feb 1 1999.
Draft 2.0.0  Jan 6 2000. New page layout.
Draft 2.2.0  Dec 23 2000. Fresh draft.
Draft 3.1415  Jan 12 2003. Nearly finished.
Draft 4.0  April 15 2003. Chapter sequence finalized.
Draft 4.1  April 18 2003. Adding new frontmatter (Preface etc)to book. Corrected printing of letter version.
Version 6.0  Thu 26 June 2003. Final version.
Version 6.6  Mon 22 December 2003. Second printing.
 Here is my method for converting to twoup under linux:
pstops '4:0L@.67(20cm,1cm)+1L@.67(20cm,15cm),3R@.67(1cm,15.25cm)\
+2R@.67(1cm,29.25cm)' $*.ps $*.dps

The Back Cover
Information theory and inference, often taught separately, are here united in one
entertaining textbook. These topics lie at the heart of many exciting areas of
contemporary science and engineering  communication, signal processing, data mining,
machine learning, pattern recognition, computational neuroscience, bioinformatics, and
cryptography.
This textbook introduces theory in tandem with applications. Information theory is
taught alongside practical communication systems, such as arithmetic coding for data
compression and sparsegraph codes for errorcorrection. A toolbox of inference
techniques, including messagepassing algorithms, Monte Carlo methods, and variational
approximations, are developed alongside applications of these tools to clustering,
convolutional codes, independent component analysis, and neural networks.
The final part of the book describes the state of the art in errorcorrecting codes,
including lowdensity paritycheck codes, turbo codes, and digital fountain codes 
the twentyfirst century standards for satellite communications, disk drives, and data
broadcast.
Richly illustrated, filled with worked examples and over 400 exercises, some with
detailed solutions, David MacKay's groundbreaking book is ideal for selflearning and
for undergraduate or graduate courses. Interludes on crosswords, evolution, and sex
provide entertainment along the way.
In sum, this is a textbook on information, communication, and coding for a new
generation of students, and an unparalleled entry point into these subjects for
professionals in areas as diverse as computational biology, financial engineering, and
machine learning.

Endorsements by famous people
This is an extraordinary and important book, generous with insight
and rich with detail in statistics, information theory, and
probabilistic modeling across a wide swathe of standard, creatively
original, and delightfully quirky topics. David MacKay is an
uncompromisingly lucid thinker, from whom students, faculty and
practitioners all can learn.
Zoubin Ghahramani and Peter Dayan
Gatsby Computational Neuroscience Unit, University College London.


An utterly original book that shows the connections between such disparate
fields as information theory and coding, inference, and statistical physics.
Dave Forney, M.I.T.


An instant classic, covering
everything from Shannon's fundamental theorems to the postmodern theory
of LDPC codes. You'll want two copies of this astonishing book,
one for the office and one for the fireside at home.
Bob McEliece, California Institute of Technology



(Shorter versions of the jacket blurb)

Endorsements by famous people
This is an extraordinary and important book, generous with insight
and rich with detail in statistics, information theory, and
probabilistic modeling across a wide swathe of standard, creatively
original, and delightfully quirky topics. David MacKay is an
uncompromisingly lucid thinker, from whom students, faculty and
practitioners all can learn.
Zoubin Ghahramani and Peter Dayan
Gatsby Computational Neuroscience Unit, University College London.


An utterly original book that shows the connections between such disparate
fields as information theory and coding, inference, and statistical physics.
Dave Forney, M.I.T.


An instant classic, covering
everything from Shannon's fundamental theorems to the postmodern theory
of LDPC codes. You'll want two copies of this astonishing book,
one for the office and one for the fireside at home.
Bob McEliece, California Institute of Technology


Endorsements by users
I am involved
with teaching undergraduate information theory at Royal Holloway. We aim to make
this book a reference text in our related courses.
I have found the
book very enjoyable to read and I think that it is far easier for students
to relate to as the maths is explicit and easy to follow, with excellent
examples.
Congratulations on an inspirational
book! This is by far the best book I have read in years!
David Lindsay, Computer Learning Research Centre (www.clrc.rhul.ac.uk)
I am compelled to state categorically that this is one of the finest text
books I have read on these subjects. I have even pilfered some of the material
for use in my classes.
Samir Chettri, UMBC
One of the best technical books ever written,
period. It's a joy to read and students I have shown it to are attracted to it like bees
to honey.
Alpan Raval, Keck Graduate Institute & Claremont Graduate University
Reviews
A quite remarkable work to have come from
a single author ... the three topic areas of its title
actually fall short of indicating its full breadth of coverage.
...
This magnificent piece of work is valuable in introducing a
new integrated viewpoint, and it is clearly an admirable basis
for taught courses, as well as for selfstudy and reference.
Alex Andrew. Robotica 2004.

`... Without compromising mathematical rigor, MacKay manages to reach through the page to the reader, creating the impression that the student is across the desk from a friendly and supportive tutor.' 
Review by H. Van Dyke Parunak

`an academic gold mine ... The book's layout makes it easy for readers to
access whatever they are concerned with.'
 Review by Sylvain Piechowiak


Alternate blurbs
Jacket blurb (215 words)
Information theory and inference, often taught separately, are
here united in one entertaining textbook. These topics lie at the
heart of many exciting areas of contemporary science and engineering
 communication, signal processing, data mining, machine learning,
pattern recognition, computational neuroscience, bioinformatics, and
cryptography.
This textbook introduces theory in tandem with applications.
Information theory is taught alongside practical communication
systems, such as arithmetic coding for data compression and
sparsegraph codes for errorcorrection. A toolbox of inference
techniques, including messagepassing algorithms, Monte Carlo
methods, and variational approximations, are developed alongside
applications of these tools to clustering, convolutional codes,
independent component analysis, and neural networks.
The final part of the book describes the state of the art in
errorcorrecting codes, including lowdensity paritycheck codes,
turbo codes, and digital fountain codes  the twentyfirstcentury
standards for satellite communications, disk drives, and data
broadcast.
Richly illustrated, filled with worked examples and over 400
exercises, some with detailed solutions, the book is ideal for
selflearning and for undergraduate or graduate courses. Interludes
on crosswords, evolution, and sex provide entertainment along the
way.
The result is a textbook on information, communication, and
coding for a new generation of students, and an unparalleled entry
point into these subjects for professionals in areas as diverse as
computational biology, financial engineering, and machine learning.
old jacket blurb
Information theory, probabilistic reasoning, coding theory and
algorithmics lie at the heart of some of the most exciting areas of
contemporary science and engineering. They are integral to such areas
as communication, signal processing, data mining, machine learning,
pattern recognition, computational neuroscience, bioinformatics, and
cryptography. David MacKay breaks new ground in this
exciting and entertaining textbook by introducing mathematical
technology in tandem with applications, providing at once both
motivation and handson guidance for problemsolving and
modelling. For example, he covers not only the theoretical foundations
of information theory, but practical methods for communication
systems, including arithmetic coding for practical data compression,
and lowdensity paritycheck codes for reliable communication over noisy
channels. The duality between communication and machine learning is
illustrated through data modelling and compression; machine learning
is also represented by clustering, classification, feedforward
networks, Hopfield networks, Boltzmann machines and independent
component analysis. A toolbox of probabilistic techniques
are covered in detail: messagepassing, Monte Carlo, and
variational methods. The final part of the book, on
sparse graph codes, describes the stateoftheart in errorcorrecting
codes, including chapters on lowdensity paritycheck codes,
turbo codes, and digital fountain codes. There are over 390 exercises,
some with full solutions, which, together with worked examples, extend
the text and enhance technique and understanding. A profusion of
illustrations enliven and complement the text. Interludes on
crosswords, evolution, and sex provide entertaining glimpses on unconventional
applications. In sum, this is a textbook for courses in information,
communication and coding for a new generation of students, and an
unparalleled entry point to these subjects for professionals working
in areas as diverse as computational biology, data mining, financial
engineering and machine learning.
150 word blurb
Information theory, probability, coding and algorithmics lie at the
heart of some of the most dynamic areas of contemporary science and
engineering. David MacKay breaks new ground in this exciting and
entertaining textbook by introducing mathematical technology in tandem
with applications, providing simultaneously both motivation and
handson guidance for problemsolving and modelling. For example, he
covers the theoretical foundations of information theory, and
practical methods for communication systems. Communication and machine
learning are linked through data modelling and compression. Over 390
exercises, some with full solutions, and nearly 40 worked examples,
extend the text and enhance technique and understanding. Enlivening
and enlightening illustrations abound. In sum, this is a textbook for
courses in information, communication and coding for a new generation
of students, and an unparalleled entry point to these subjects for
professionals working in areas as diverse as computational biology,
data mining, financial engineering and machine learning.
90 word blurb
Information theory, probabilistic reasoning, coding theory and
algorithmics underpin contemporary science and engineering. David
MacKay breaks new ground in this exciting and entertaining textbook by
introducing mathematics in tandem with applications. Over 390
exercises, some with full solutions, extend the text and enhance
technique and understanding. Enlivening and enlightening illustrations
abound. It's ideal for courses in information, communication and
coding for a new generation of students, and an unparalleled entry
point to these subjects for professionals working in areas as diverse
as computational biology, datamining, financial engineering and
machine learning.
50 word blurb
This exciting and entertaining textbook is
ideal for courses in information, communication and coding for a new
generation of students, and an unparalleled entry point to these
subjects for professionals working in areas as diverse as
computational biology, datamining, financial engineering and machine
learning.
Another old short blurb This textbook offers comprehensive
coverage of Shannon's theory of information as well as the theory of
neural networks and probabilistic data modelling. It includes
explanations of Shannon's important source encoding theorem and noisy
channel theorem as well as descriptions of practical data compression
systems. Many examples and exercises make the book ideal for students
to use as a class textbook, or as a resource for researchers who need
to work with neural networks or stateoftheart errorcorrecting
codes.
Yet another old blurb This textbook offers comprehensive
coverage of Shannon's theory of information as well as the theory of
neural networks and probabilistic data modeling. Shannon's source
coding theorem and noisy channel theorem are explained and
proved. Accompanying these theoretical results are descriptions of
practical data compression systems including the Huffman coding
algorithm and the less well known arithmetic coding algorithm. The
treatment of neural networks is approached from two perspectives. On
the one hand, the informationtheoretic capabilities of some neural
network algorithms are examined, and on the other hand, neural
networks are motivated as statistical models. With many examples and
exercises, this book is ideal for students to use as the text for a
course, or as a resource for researchers who need to work with neural
networks or stateoftheart error correcting codes.

Contents 

 1  Introduction to Information Theory 
  2  Probability, Entropy, and Inference 
  3  More about Inference 
Part I  Data Compression 
  4  The Source Coding Theorem 
  5  Symbol Codes 
  6  Stream Codes 
  7  Codes for Integers 
Part II  NoisyChannel Coding 
  8  Dependent Random Variables 
  9  Communication over a Noisy Channel 
  10  The NoisyChannel Coding Theorem 
  11  ErrorCorrecting Codes and Real Channels 
Part III  Further Topics in Information Theory 
  12  Hash Codes: Codes for Efficient Information Retrieval 
  13  Binary Codes 
  14  Very Good Linear Codes Exist 
  15  Further Exercises on Information Theory 
  16  Message Passing 
  17  Communication over Constrained Noiseless Channels 
  18  Crosswords and Codebreaking 
  19  Why have Sex? Information Acquisition and Evolution 
Part IV  Probabilities and Inference 
  20  An Example Inference Task: Clustering 
  21  Exact Inference by Complete Enumeration 
  22  Maximum Likelihood and Clustering 
  23  Useful Probability Distributions 
  24  Exact Marginalization 
  25  Exact Marginalization in Trellises 
  26  Exact Marginalization in Graphs 
  27  Laplace's Method 
  28  Model Comparison and Occam's Razor 
  29  Monte Carlo Methods 
  30  Efficient Monte Carlo Methods 
  31  Ising Models 
  32  Exact Monte Carlo Sampling 
  33  Variational Methods 
  34  Independent Component Analysis and Latent Variable Modelling 
  35  Random Inference Topics 
  36  Decision Theory 
  37  Bayesian Inference and Sampling Theory 
Part V  Neural networks 
  38  Introduction to Neural Networks 
  39  The Single Neuron as a Classifier 
  40  Capacity of a Single Neuron 
  41  Learning as Inference 
  42  Hopfield Networks 
  43  Boltzmann Machines 
  44  Supervised Learning in Multilayer Networks 
  45  Gaussian Processes 
  46  Deconvolution 
Part VI  Sparse Graph Codes 
  47  LowDensity ParityCheck Codes 
  48  Convolutional Codes and Turbo Codes 
  49  RepeatAccumulate Codes 
  50  Digital Fountain Codes 
Part VII  Appendices 
   Notation; Some Physics; Some Mathematics 
Library of Congress
 Dewey number: 003/.54
 Dewey version: 21
 LC Classification: Q360 .M23 2003
 LC Subject headings:
Library of Congress Record
DOI: 10.2277/0521642981
(ISBN13: 9780521642989  ISBN10: 0521642981)
Available in South Asia Edition in Paperback: ISBN 0521670519 (South Asia only)
Available in Chinese translation from Higher Education Press
ISBN={7040196417}

South Asia Edition 0521670519
Cambridge University Press have authorized
a special paperback edition of the
book for sale in South Asia.
The South Asia Edition, published by Foundation Books, has ISBN 0521670519 and is available only in India,
Pakistan, Bangladesh, Nepal, Bhutan and Maldives.
For the convenience of purchasers in these areas, here is a page with links
to the online booksellers in India of whom I am aware.

David J.C. MacKay
Information Theory, Inference, and Learning Algorithms



Cambridge University Press, 2003



South Asia Edition 0521670519
Cambridge University Press have authorized
a special paperback edition of the
book for sale in South Asia.
The South Asia Edition, published by Foundation Books, has ISBN 0521670519 and is available only in India,
Pakistan, Bangladesh, Nepal, Bhutan and Maldives.
Here are URLs for some Indian booksellers.
INR 550.00 from
www.manoharbooks.com.
INR 650.00 from
www.indianbooks.co.in
Manohar Book Service
4753/23 Ansari Road, Daryaganj, New Delhi (Just opposite Oxford University Press)
New Delhi, India (found through this URL) (and this URL)
Manohar Book Service
2/6 Ansari Road, Daryaganj,
New Delhi 110 002
India
Website http://www.manoharbooks.com
Phone number 91 11 23284848
Oscar Publications
302, Nitika TowerII
Azadpur Commercial Complex
Delhi  110 033
India
Phone: 911127676826
Fax: 911127671921


Draft covers for the book by DJCM.

Errors in the book
I would be
grateful to hear about errors in the book.
A list of corrections is provided.
Please send me new corrections by email. Thank you!
If you have a query about the book that you think
other people might also ask, please use one of the following links
to submit your query through metaFAQ; you may find the answer
is already there:

Query categories
Any other query


Software, demonstrations, etc.
Hint: To force download to file, use shiftclick
The tcl programs and other links marked (*) can be
run directly in netscape if you have got the tcl/tk
plugin.
You can also run the tcl programs in the normal way if you have a
linux machine with X windows and tcl and you download the tcl files
then execute them (this is preferable to the netscape approach, I think).
[The top four lines of each tcl file contain flags which change
the program from working with the netscape plugin to working
alone under X.]
The other programs can be used by people on unix systems that
have the relevant software (perl, gnuplot, octave.)
Please select a software category from the sidebar.

Hint: To force download to file, use shiftclick
The tcl programs and other links marked (*) can be
run directly in netscape if you have got the tcl/tk
plugin.
You can also run the tcl programs in the normal way if you have a
linux machine with X windows and tcl and you download the tcl files
then execute them (this is preferable to the netscape approach, I think).
[The top four lines of each tcl file contain flags which change
the program from working with the netscape plugin to working
alone under X.]
The other programs can be used by people on unix systems that
have the relevant software (perl, gnuplot, octave.)

Inference methods
Hint: To force download to file, use shiftclick
The tcl programs and other links marked (*) can be
run directly in netscape if you have got the tcl/tk
plugin.
You can also run the tcl programs in the normal way if you have a
linux machine with X windows and tcl and you download the tcl files
then execute them (this is preferable to the netscape approach, I think).
[The top four lines of each tcl file contain flags which change
the program from working with the netscape plugin to working
alone under X.]
The other programs can be used by people on unix systems that
have the relevant software (perl, gnuplot, octave.)

Online Hamiltonian Monte Carlo Demo
(formerly known as `Hybrid Monte Carlo')
Task: sample from the posterior distribution of a bivariate Gaussian
distribution given ten data points.
The lefthand image shows the details underlying a sequence of
HMC transitions. The righthand image shows 50 of the successive states
produced by the Markov chain  the endpoints of the trajectories.
fit.tar tar file giving octave source code.
There is also an introductory
octave DEMO
that explains Hamiltonian Monte Carlo
and contrasts it with the Metropolis method,
using the same "stronglycorrelated Gaussian"
example as is used in the
Adler Overrelaxation demo.

Hint: To force download to file, use shiftclick
The tcl programs and other links marked (*) can be
run directly in netscape if you have got the tcl/tk
plugin.
You can also run the tcl programs in the normal way if you have a
linux machine with X windows and tcl and you download the tcl files
then execute them (this is preferable to the netscape approach, I think).
[The top four lines of each tcl file contain flags which change
the program from working with the netscape plugin to working
alone under X.]
The other programs can be used by people on unix systems that
have the relevant software (perl, gnuplot, octave.)

Hint: To force download to file, use shiftclick
The tcl programs and other links marked (*) can be
run directly in netscape if you have got the tcl/tk
plugin.
You can also run the tcl programs in the normal way if you have a
linux machine with X windows and tcl and you download the tcl files
then execute them (this is preferable to the netscape approach, I think).
[The top four lines of each tcl file contain flags which change
the program from working with the netscape plugin to working
alone under X.]
The other programs can be used by people on unix systems that
have the relevant software (perl, gnuplot, octave.)

Sparse Graph Codes

Other files
 Images for Chromatic abberation demo:
 1
 2

Links to other people's software
Haskell (Naive) Bayes library by Frederik Eaton

Dedication
My textbook is dedicated to the
Campaign Against the Arms Trade
My other book
is dedicated "to those who will not have the benefit of two
billion years' accumulated energy reserves".

Extra materials
Additional materials associated with
the book.

Feel free to make links to this website!
To make a link like this: 

use this html:
<table cellpadding=2 cellspacing=2 bgcolor="#47aeba"><tr><td>
<table bgcolor="#ffffee" cellpadding=2><tr><td>
<a href="http://www.inference.phy.cam.ac.uk/mackay/itila/"><img
border=0 align=right
src="http://www.inference.phy.cam.ac.uk/mackay/itila/images/Sept2003Cover25.gif"
alt="Information Theory, Inference, and Learning Algorithms"
width="137" height="180"></a>
</td><td width=180><small><a
href="http://www.inference.phy.cam.ac.uk/mackay/itila/">Information Theory,
Inference, and Learning Algorithms</a><br>
by David MacKay</small>
</td></tr></table>
</td></tr></table>


All the figures
For the use of teachers,
all the figures are provided, one figure per page,
in one 343page file (postscript)  pdf.
You can also select individual figures (in encapsulated postscript files)
from this directory.
FAQ

How do I take one page of your postscript file
and get the bounding box of the figure to be correct?

I have tried to do this for you
in this directory, using the two methods described below.
Alternatively 
Use this perl script by Dick Furnstahl:
bbox_add.pl
 bbox_add.pl
(which google found
here).
You can see example output of this here (applied to figures 7 and 93):
example7.ps
 example93.ps.
How it works
It invokes gs sDEVICE=bbox dNOPAUSE dBATCH file.ps
to print the bounding box information for file.ps (and then exit gs).
The perl script will automatically find the bounding box and
correct the postscript file (or add the info if there is none).
Be sure to make it executable:
chmod +x bbox_add.pl

An alternative answer is
get DJCM to make the page using dvips thus:
dvips E p 7 l 7 ../figures.dvi o ex7.eps
 is this eps output any better for you?
 How do I convert PostScript to EPS?

Use pstoepsi or ps2epsi (this didn't actually work for me).
Or ps2eps
 ps2eps.
DESCRIPTION
ps2epsi uses gs(1) to process a PostScript(tm) file and generate as output a
new file which conforms to Adobe's Encapsulated PostScript Interchange
(EPSI) format. EPSI is a special form of encapsulated PostScript (EPS)
which adds to the beginning of the file in the form of PostScript comments a
bitmapped version of the final displayed page. Programs which understand
EPSI (usually word processors or DTP programs) can use this bitmap to give a
preview version on screen of the PostScript. The displayed quality is often
not very good (e.g., low resolution, no colours), but the final printed version
uses the real PostScript, and thus has the normal PostScript quality.
USAGE
On Unix systems invoke ps2epsi like this:
ps2epsi infile.ps [ outfile.epsi ]


Extra exercises
Information theory
 Solve the weighing problem for N coins of
which one is odd, and known to be lighter.
You may find it interesting to minimize the
average number of weighings required,
(rather than the maximum number) for the
cases N=6 and N=7.
[D. G. Mead, The average number of weighings to locate
a counterfeit coin,
IEEE Trans IT Sept 1979, vol 25, no 5, 616617]
Efficient HIV tests.
$N$ individuals give blood samples
for HIV testing. The prior probability that individual n is HIV
positive is p_n. For simplicity assume all p_n are 1/100
and consider N=700 as an example.
HIV tests are expensive. Instead of testing each of the N samples
separately, can you save tests by mixing subsamples together and testing
them simultaneously? Assume that the result of a merged
test is the simple 'or' of the binary variables x_n that
state whether the individuals are HIV positive.
The task of sorting N objects by comparing them two at a time
is one of the cornerstones
of computer science algorithms.
Finding the correct order from the N! possible orders
requires log_{2}(N!) bits of information. So any method
based on comparing two objects must require log_{2}(N!) or more
comparisons.
There are many sorting algorithms whose expected complexity
(assuming randomly ordered input) is of order N log N
(for example, Heapsort and Quicksort are famous). But in practice, when sorting
say 3000 objects, Heapsort uses more than twice as many comparisons as Quicksort!
And Quicksort is far from perfect: on average it uses 1.39 times as many comparisons
as the ideal number [log_{2}(N!)].
Use information theory to criticise Heapsort and Quicksort.
Design a sorting algorithm that is better in the sense that
it uses fewer comparisons, on average.
Solution by David MacKay
How many binary comparisons are required to sort 5 objects?
[Hint: 5! = 120 < 128]
Can you find a method that guarantees to sort them in 7 comparisons?
Mutual information.
Consider a channel with input $x$ and output $y$.
Let Y = some integer greater than 5.
Let X = Y choose 2.
Let the marginal distribution of X be uniform.
Let the transition matrix from X to Y be a weighttwo matrix like this:
(with every 1 entry replaced by 0.5)
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0
0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0
0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1
[rows are labelled by y, cols by x.]
Now, assume that one input $x$ is selected randomly and
transmitted twice over the channel.
Let $y_1$ and $y_2$ be the two outputs.
Find the mutual informations $I(X;Y_1)$
and $I(Y_1;Y_2)$.
Solution:
I(Y1:Y2) = logY  ( 1 + 0.5 log(Y1) )
I(X:Y1)  H(Y1X) = logY  2

A correlation puzzle [Hard]

Investigate the two codes used on CDs, and quantify how far each one is
from the capacity of an appropriate channel model.
[CDs use an errorcorrecting code and a runlength limiting
code. The ECC is composed of three interleaved reedsolomon codes;
typical parameters in bytes are (N,K)=(28,24) and (32,28).
The inner runlengthlimiting code is known as "EFM", which stands for
eighttofourteen modulation; this code converts every eight bits into
a string of seventeen bits (yes, seventeen, not fourteen!)
in such a way that the smallest run length is 3 bits, the average run length
is 7 bits, and the maximum runlength is 11 bits.]
 Entropy inequalities
The order$r$ {R\'enyi} entropy is
H^{(r)}(p) = (1/(1r)) log ( Sum_i p_i^r ) , for r ¬eq; 1.
Using Jensen's inequality, prove that, for any $r$ this function of $p$ is a bound on the
Shannon entropy of $p$.
[Hint: I found the case $r=2$ easiest to start with.]
 divergences and distances
Show, by constructing example probability distributions a, b, c,
that the relative entropy D_KL does not satisfy the
triangle inequality that 'distances' satisfy.
(D(a,c) <= D(a,b) + D(b,c).)
What about the symmetrized relative entropy,
D_SRE(a,b) = 0.5(D_KL(a,b) + D_KL(b,a))?
Prove that D_SRE(a,b) is a distance, or prove, by constructing example probability distributions a, b, c,
that the D_SRE does not satisfy the
triangle inequality that 'distances' satisfy.
solution]
 Compression
One simple way to encode a binary file of length N=10,000 in which a
fraction f=0.01 of the bits are 1s is to send the 100 or so
positions of the 1s using 14bit integers.
This code has redundancy because there are many ways (roughly 100! = 100x99x98x...x1)
of encoding one file, since the positions can be conveyed in
any order.
In the bits back coding method, this redundancy is
exploited by piggybacking other data onto the transmission.
In this case, another source of bits can be
used to choose from the 100! possible orders of the bits.
The receiver can then recover the extra bits once the
message has been received.
Implement this bitsback method.
Can you find a way to implement bitsback coding
such that the `extra bits' are actually
the original coded positions? (A bit like a snake swallowing itself!)

Coding for a redundant language.
[ABC] > 2
[DEF] > 3
[GHI] > 4
[JKL] > 5
[MNO] > 6
[PQRS] > 7
[TUV] > 8
[WXYZ] > 9
 The standard mobile phone encoder

When sending text on mobile phones, some people use the iTap/T9/predictive text
system, in which, for example, HELLO is written as 43556,
and, since no other word in the dictionary is encoded as 43556,
the phone deduces that you wanted HELLO.
However, some words collide under this T9 encoding, for example
KISS and LIPS are both 5477.
Investigate how much better the T9 system would work if
a different encoding of the alphabet AZ onto the numbers 2..9
were used.

Capacity of a simple deletion channel.
A channel has 8 input bits and the output is always exactly 7 bits.
One of the 8 bits is deleted, at random. The other 7 are
transmitted without error. (Note the difference from
an erasure channel, which replaces the erased bits by "?".)
Estimate the capacity of this channel.
Is the optimal input distribution uniform?
Find the highest rate code that you can, that communicates reliably over this channel.

Distance properties of codes, related to erasure channels.
Imagine a block code with symbols drawn from an alphabet A has the
parameters (N, K, d), meaning block length N, number of
codewords (2**K), and minimum distance d.
Here, 'distance' means 'the number of symbols in which
two words differ from each other'.
The erasure channel replaces entire symbols by an erasure symbol.
Prove the following theorem:
1. An (N,K,d) code can correct any (d1) erasures in a block.
2. But there exist some erasures of d symbols that the code cannot
correct.
Solution:
1. Imagine that E <= d1 erasures have occurred.
The receiver knows that the true codeword "t" is in the set of
(A**E) candidate strings that can be created by
replacing all erasures in all possible ways.
The question is, could erasure correction fail? It can only
fail if there exists (at least) one other codeword
in that set. We proceed by contradiction.
Assume that there is another such rival codeword, u.
u and t are identical in (NE) positions, and can differ in at most
E positions. But if E <= d1 that would contradict the
given distance property of the code. Hence there cannot
be a rival codeword u.
2. Take two words t, u, that differ in d places. Transmit
t, and Erase those d symbols.
The receiver cannot know which of t or u was transmitted. QED
Advanced Information Theory
 Channel with frozen bits [5].
Imagine a memory device having the following
property:
of the N bits, a fraction $s$
are frozen to particular values, randomly chosen from {0,1}.
It's not known in advance which bits will be stuck.
The encoder can find out which bits are stuck at encoding time.
The decoder can only read out the values of the N bits.
To make life more difficult, we might assume that the readout process
is noisy, with a flip probability $f$ for all bits.
What is the capacity of this channel?
How should the encoder set the nonfrozen bits in order to achieve
reliable communication of $K$ source bits?
Can you make a practical encoding and decoding algorithm?
Inference
 Cumulative distributions (inference of)
You've got a sample $\{ x_n \}_{n=1}^N$ of points
assumed to come from a distribution, and you
want to show what you believe the cumulative distribution looks like,
given these points. We want not only a guess but also error bars.
Perhaps you want to compare two cumulative distributions
with each other and quantify whether they seem to be `significantly different from each other'.
How do you display a guess for the cumulative distribution function, with error
bars?
Solution (one page postscript)
Solution (one page pdf)
 Message passing
Imagine you want to help out a dictation school, where students are
given dictation, then the teacher has to count how many errors each
student made in what they wrote.
Write an inference algorithm that aligns
one piece of text with another, assuming that
the second one differs from the first by random
insertions, deletions and substitutions.
Assume a small probability of insertion per character, p_i,
a similar probability of deletion, p_d, and a substitution probability p_s.
Find both the most probable single alignment (using the minsum algorithm, aka Viterbi),
and the probability distribution of the alignment at any
particular character in the noisy sequence (using the sumproduct algorithm).
Can you make a program that, as the second user writes his text,
computes on the fly where the second user has probably got up to?
Remember, the aim is for efficient computation, so try to make a method whose cost is
at most proportional to N per iteration, where N is the length of the first piece of text.
 Message passing II
Imagine that N coins, with biases p_{n}, are tossed, once each;
they are independent but have different biases. The outcome is a binary vector x.
The outcomes (1/0) are summed, and the sum is c.
What is the probability, given this sum, that coin n had outcome 1?
Use a messagepassing method to answer this question, and to generate
efficiently a sample from the posterior distribution of x.
[Hint: one option is to make a trellis in which the vertical coordinate
is the partial sum after the nth outcome is added to the total;
the horizontal coordinate runs over the n coins. Run a forward pass
to find the probability of each outcome conceivable outcome c;
run a backward pass from the particular value of c
that occurred; then (just like the pathcounting example)
use the product of the forward and backward messages
to pick a path through the trellis from the posterior distribution.]
 Metropolis method
Assume the data set {x} = { 1.1, 0.5, 1.8, 2.1, 0.3 }
come independently from an exponential distribution
P(xlamba) = exp(x/lamba)/Z(lambda).
Use a Metropolis method to draw samples from the posterior
distribution of lambda.
(Pick an appropriate prior on lambda.)
 Reversibility
Show, by constructing a simple example, that
Gibbs sampling (in which the K variables are updated
in a fixed sequence)
is not reversible,
even though it is composed of a sequence of
K transition operators that are reversible.
 Change point inference.
Make some data that are a noisy version of
a simple function with a discontinuity.
eg, y = (t < t0) ? y0 : y1
Add Gaussian noise to y to get N data points {d}.
Now imagine that you do not know the time t0 of the
discontinuity. Infer it.

Hidden Markov Model.
Implement an algorithm that, given parameters pi, A, B of a hidden Markov model,
and given a data sequence x, will

find the posterior probability of each state at each time [using sumproduct algorithm]

find the most probable statesequence [maxproduct algorithm, aka minsum, aka Viterbi]

draw a sample from the posterior distribution
of the hidden sequence [using information from the sumproduct algorithm].
Construct a simple example that illustrates why the sequence found by the
minsum (Viterbi) algorithm may be a poor way of representing the posterior distribution.

Life in high dimensions.
Consider a hypersphere of radius 1 and a hypercube of radius r (i.e.,
width of cube is 2r, and diameter of sphere is 2), both in K dimensions.
Set K to 256.
Use simple Monte Carlo methods to find
 The fraction of the hypersphere that is enclosed inside the hypercube as
a function of r, for r from 0 to 1. How small can r go, and still give 95%
of the hypersphere inside the cube?

The fraction of the hypercube that is enclosed in the hypersphere as
a function of r, for r from 0 to 1. When the cube contains 95% of the hypersphere,
what fraction of the cube is inside the hypersphere?
(Exercise by John Platt; for further reading see his 2004 paper on bitindexing.)
Decision theory
 In financial mathematics, it is often assumed that
a stock price will vary in accordance with
a random walk with a drift parameter mu
and a volatility parameter sigma.
d log S = mu dt + sigma dz
where dz is a Wiener process with mean zero and variance dt.
An option is a financial instrument giving the owner
the right (but not the obligation) to buy or sell something.
For example, a gambler interested in owning
a stock in case its value goes up a lot might buy an option to
buy the stock at some future time $T$ for a fixed price $K$
(known as the strike price).
If the price indeed is above $K$ at the expiration time,
the gambler can
exercise the option and turn an instant profit of
$S_T  K$ by immediately selling the stock.
The gambler has to pay an appropriate price for this option.
What should the price of an option be?
A `European' option can be exercised only at its expiration time $T$;
an `American' option may also be exercised at any time
prior to $T$. Chop time up into $N$ steps of duration
$T/N$ and use a discretetime messagepassing algorithm
to work out the correct price for an option of each type,
as a function of the strike price $K$ and the other parameters.
Assume the gamblers who price options wish to maximize their
expected return (not the log of their return) and for simplicity
assume that the interest rate for cash is zero.
(a) Model the variation in the log of the stock
price by a random walk with
discrete steps up or down by $log f$, with
probabilities $p_{+}$ and $p_{}$ respectively.
[For thoroughness, if you want, show
that $log f$ and $p_{+} 1/2 + epsilon$ are given in terms
of the other parameters and $alpha = mu*mu*T/(N*sigma^2)$
by
epsilon = 0.5 / sqrt( 1.0 + 1.0/alpha )
log f = sigma^2/mu * alpha / (2.0 * epsilon) ]
Work out the expected value of the European option
by propagating backward from the leaves of the
tree of possible prices.
(b) For the American option, your messagepassing algorithm
should keep track, for every node in the tree, of
1) the payoff if the option is exercised at that time;
2) the expected payoff if the option is not exercised
at that time, and at subsequent times the optimal decisions
are made.
3) the value of the option (which is the max of those two quantities).

St Petersburg Envelope paradox
You are
presented with two envelopes A and B,
and told that one contains twice as much money as the other.
You are
given envelope A and allowed to see how much is in it,
and offered the options of keeping envelope A or switching to B. What should you do?
Amos Storkey's statement of the
question explains why this is called a paradox:
You are taking part in a game show. The host introduces you to two envelopes. He explains carefully that you will get to choose one of the envelopes, and keep the money that it contains. He makes sure you understand that each envelope contains a cheque for a different sum of money, and that in fact, one contains twice as much as the other. The only problem is that you don't know which is which.
The host offers both envelopes to you, and you may choose which one you want. There is no way of knowing which has the larger sum in, and so you pick an envelope at random (equiprobably). The host asks you to open the envelope. Nervously you reveal the contents to contain a cheque for 40,000 pounds.
The host then says you have a chance to change your mind. You may choose the other envelope if you would rather. You are an astute person, and so do a quick sum. There are two envelopes, and either could contain the larger amount. As you chose the envelope entirely at random, there is a probability of 0.5 that the larger check is the one you opened. Hence there is a probability 0.5 that the other is larger. Aha, you say. You need to calculate the expected gain due to swapping. Well the other envelope contains either 20,000 pounds or 80,000 pounds equiprobably. Hence the expected gain is 0.5x20000+0.5x8000040000, ie the expected amount in the other envelope minus what you already have. The expected gain is therefore 10,000 pounds. So you swap.
Does that seem reasonable? Well maybe it does. If so consider this. It doesn't matter what the money is, the outcome is the same if you follow the same line of reasoning. Suppose you opened the envelope and found N pounds in the envelope, then you would calculate your expected gain from swapping to be 0.5(N/2)+0.5(2N)N = N/4, and as this is greater than zero, you would swap.
But if it doesn't matter what N actually is, then you don't actually need to open the envelope at all. Whatever is in the envelope you would choose to swap. But if you don't open the envelope then it is no different from choosing the other envelope in the first place. Having swapped envelopes you can do the same calculation again and again, swapping envelopes back and forward adinfinitum. And that is absurd.
That is the paradox. A simple mathematical puzzle. The question is: What is wrong? Where does the fallacy lie, and what is the problem?
Amos's solution.
Another solution,
Further discussion.

Using information content to get insight into algorithm complexity.
Consider an algorithm that sorts a bunch of objects by calling
a subroutine that compares two of them for you, returning
either "+" or "". For simplicity, assume there are no ties..
How many times must an algorithm call the binary comparison
operation to be able to guarantee that it will correctly
sort the list?
[Hint: Think of the initial state as being a permutation,
which we are conveying to someone. How much information is
there in a permutation?]

Confidencebased assessment .
Consider testing students using multiplechoice questions, or
questions with yes/no answers.
In a confidencebased assessment, a students rating of
his confidence in an answer is taken into account in the
marking scheme, so as to reward students for accurately
acknowledging their uncertainty.
In the LAPT schem, for
example, you earn 1, 2, or 3 marks for a correct answer
depending on whether you gave your confidence level as `1',
`2', or `3'; incorrect answers receive 0, 2, and 6 marks
respectively.
Comment on this scheme. How should a student
respond in terms of this `confidence' value, if he thinks there
is a probability p that his answer is correct?
How would you design a confidencebased assessment system?
Deconvolution
(this question also belongs in the chapter about labelling dice with nonstandard
values, in the Data Compression / Probability area.)

Is it possible to label the faces of two sixsided dice in such
a way that the probability distribution of the sum is identical
to the distribution of the sum that holds for two regular dice labelled 1..6?
(Alan Beardon, 2005).
Hint: this is one of the rare puzzles for which probability generating functions
are useful.
Hint2: are there any solutions apart from the trivial solutions
in which for example one die is labelled 0..5 and the other 2..7?
Hint 3: this problem is equally easy for the case of 4sided dice.
Neural networks
 Another derivation of the Hopfield network
Think of associative memory as an inference problem.
Imagine that you know that a set of N memories
{x} (with elements {+/1}
have given rise to `data' in the form of the Hebbian weight matrix
W = sum x x^T.
Imagine furthermore that a noisy version y of one of the memories x0
is provided.
Your task now is to infer x0
given the available information, W, y, and N.
Some approximations are required in order to get a simple
answer out.
Write down the posterior of x0.
The contribution from y to the likelihood of x0 is a simple bias.
The central quantity to think about is the other likelihood factor,
P(Wx0).
Approximate this quantity by assuming that, conditional on x0,
W_ij is Gaussiandistributed with mean
mu=x0_i x0_j
and variance (N1)
(treating the contributions from the (N1) other memories
as independent).
Approximate the likelihood factor P(Wx0) by
prod_{ij} P(W_{ij}x0).
[An approximation since in fact the other patterns {x} introduce
weak dependencies among the W_{ij}.]
Now write down dynamics for each variable x^0_i that greedily maximize
the posterior probability P(x0W,y).
What have you got?
Thanks to Máté Lengyel for suggesting this exercise.
Lyapunov functions

Once you have understood `Southeast', you may be able
to tackle Conway's Soldiers.
See the problem definition at mathworld;
for a nice applet that can (with one click) embody Conway's soldiers go
here.
The exercise is this: find a Lyapunov function that can be used to prove that
it's impossible to hop a solitaire peg a distance greater than 4 from a wall of pegs.
Solution can be found at Plus, though
their image explaining the Lyapunov function is off by one row.

Extra topics I would have liked to include in the book.
 Some notes on compression using runlength codes (one page)
 postscript
 pdf 
 Randomized algorithms
 The SwendsenWang algorithm 
pdf 
 Cycle graphs and random permutations 
pdf 
 Counting trees (coming soon)
 Cuckoo hashes
 Bloom filters
 Nested sampling
 Raptor codes

Randomized Algorithms
Consider the task of sorting N objects (eg numbers).
A possible randomized algorithm works as follows.

Pick one of the objects at random, and pretend that it is
the median of the set.
Compare all (N1) others with it, thus dividing
the others into two sets of size A and B.
Certainly A+B = N1; if we got lucky,
A ~= N/2.
Maybe, typically (in some sense), A ~= N/3 and B ~= 2N/3,
or vice versa.

Repeat step 1 on each of the two sets (until A = 1 and B = 1).
If we get lucky at every step (i.e., we hit the true median every time)
then the number of
comparisons of objects with each other will be
pretty accurately
C_{min} = N log_{2} N.
Failing to hit the median means that our binary tree will be
lopsided, and more comparisons will be needed. How many comparisons?
We still expect something that scales as N ln N,
but the base of the logarithm will be different.
The "2" in the "log_{2}" in
C_{min}(N) = N log_{2} N
is there because the perfect median choice makes a balanced
binary tree, and log_{2} N is the depth of that
tree. The intuition is:
When we make an unbalanced tree, the "2" changes to a number smaller
than 2; as we'll see, it turns out to be roughly 1.6.
C_{randomized algorithm} ~= N log_{1.6} N
There are two rather different ways to prove this result.
Recurrence relation proof
Let the average cost, in comparisons, of sorting N items by
the randomized algorithm be T(N).
Then, considering the first step, with the pseudomedian being
drawn at random with u being uniform distributed between 0 and 1,
T(N) ~= N + < T(uN) > + < T((1u)N) >
where < T((1u)N) > denotes the average of T((1u)N),
and we've neglected the distinction between N and N1.
We need to solve this recurrence relation, with boundary condition
T(1)=T(0)=0.
Let's guess that
T(N) = a N ln N
and solve for a
Then
T(N) ~= N + T(N)  a N < H_2^(e)(u) >
For consistency this gives
1 = a < H_2^(e)(u) >
Carrying out the integral we find the neat result:
a = 2.
So the base of the logarithm  the number formerly known as `1.6' 
is in fact sqrt(e).
`Means add' proof
Another neat derivation of the cost of the randomized sort
algorithm, which gives little insight into the correct answer,
but generates it quite easily,
runs as follows.
Imagine that the numbers are already sorted, and
consider two numbers whose ranks in the list of size $N$ are $m$ and $n$, with $m < n$.
When we apply the algorithm, what is the chance $c_{mn}$ that $m$ will, at some
point along the way, get compared with $n$?
Well, if any of the $nm1$ numbers between $m$ and $n$ gets selected before
both $m$ and $n$, then the subsequent comparisons will separate $m$ and $n$ into
separate sets, so that they never get compared with each other.
On the other hand, if one of $m$ and $n$ is selected before any of those $nm1$ numbers,
then $m$ and $n$ will instantly be compared with each other in the
next comparison step.
Thus
c_{mn} = FRACTION {2}{nm+1} ;
and the expected number of comparisons
is the sum over all pairs,
C(N) = {m,n: m < n} FRACTION {2}{nm+1}

Comparison of Information Theory, Inference, and Learning Algorithms
with Harry Potter
OK, you're tempted to buy MacKay's book, but you're not sure whether
it's the best deal around?
Let's compare it with another textbook with a similar sales rank.

Information Theory, Inference, and Learning Algorithms
David J.C. MacKay

Harry Potter and the Philosopher's Stone
J.K Rowling

Sales rank
amazon.co.uk, Mon 5/1/04 
5,667 
2,486 
List Price (UK) 
£30.00 
£11.99 
Number of pages 
640 
223 
Cost per page 
4.6p 
5.4p 
Has large margins for making notes? 
Yes 
No 
Includes comprehensive index? 
Yes 
No 
Number of tables, figures, and algorithms 
462 
0 
Free software provided for use of teachers and students? 
Yes 
No 
Number of exercises 
More than 400 
None 
Entire book available for free online viewing? 
Yes 
No 
Also available in paperback? 
No 
Yes 
Available in Latin translation? 
No 
Yes 
Available from Barnes and Noble?
(on Mon 5/1/04)

Yes 
Only in Latin and Welsh translations 

Table 1 
The comparisons in Table 1 show that the popular textbook by Rowling
has a lower cover price than MacKay's new work.
However, the prospective buyer should take into account
the significantly greater length of MacKay's text:
both textbooks have a cost per page of about 5 pence.
In terms of userfriendliness, MacKay's text has the
edge when it comes to its thorough index, the rich use
of figures, the provision of free software for the use of
the reader, and the profusion of examples and exercises, many with worked solutions.
A strong selling point for MacKay's book is its free availability for
online viewing. Unlike Rowling, who is notoriously secretive
about her textbooks before they are published, MacKay
made his entire book available online while it was a work in progress.
A possible weakness of MacKay's product is that it is only available in
hardback.
Native speakers of Latin may also prefer Rowling's text, since
a full translation into Latin is now available.
The issue that must clinch the choice, however, is availability.
Our reviewer tried to purchase both texts from Barnes and Noble on Mon 5/1/04,
and found that the Rowling text is not available.^{1}
In conclusion, we can give a cautious recommendation of Harry Potter
only to speakers of Welsh and Latin;
for everyone else, we recommend Information Theory, Inference, and Learning Algorithms.
Footnotes
[1] Rowling's text has
however been translated into American, and released under a different title.

Related books

Probability and Computing: Randomized Algorithms and Probabilistic Analysis
by
Michael Mitzenmacher and Eli Upfal
amazon.co.uk /
.com

Probability Theory: The Logic of Science
by
E T Jaynes (G.Larry Bretthorst, Editor)
amazon.co.uk /
.com

Any Questions?
If you have a query about the book that you think
other people might also ask, please use one of the following links
to submit your query through metaFAQ; you may find the answer
is already there:

Query categories
Any other query


Site last modified Mon Oct 29 12:10:08 GMT 2012

