Solution of a Toy Problem by Reinforcement Learning

David J C MacKay, Peter Latham, and Gatsby Unit Colleagues

Problem statement
You get to toss coins, each of which is either of type H, having probabilities favouring heads $\{(1\!-\!f),f\}$ or of type T, with probabilities $\{f, (1\!-\!f)\}$, where $f$ is a known bias parameter such as $f=0.1$. At each time step, one may either {\em{continue\/}} tossing the current coin, or one may {\em{declare\/}} a guess for whether it is of type H or T. Declarations are rewarded if accurate and penalized if incorrect. The cost of continuing is zero. The cost of declaring incorrectly is a penalty of $R_{-}$; declaring correctly yields a reward of $R_{+}$. To make the problem interesting, we set $R_{+} < R_{-}$. The task is to choose a {\em{strategy}\/} that maximizes the average return per unit time. We include no discounting in the problem. Once you have declared for coin $t$ and been rewarded or penalized, you get to move on to coin $n+1$. There is an infinite supply of coins, all with the same bias $f$. The two types $H$ and $T$ are equiprobable. Declaring takes no time: when you declare, you immediately get to toss the next coin.

postscript (Cambridge UK).

postscript (Canada mirror).

pdf (Cambridge UK).

pdf (Canada mirror).

All postscript files are compressed with gzip - see this page for advice about gzip, if needed.


related publications.
David MacKay's: home page, publications. bibtex file.
Canadian mirrors: home page, publications. bibtex file.