Evidence that the delay-period activity of dopamine neurons corresponds to reward uncertainty rather than backpropagating TD errors
© Fiorillo et al; licensee BioMed Central Ltd. 2005
Received: 28 May 2005
Accepted: 15 June 2005
Published: 15 June 2005
We previously demonstrated the presence of delay-period activity in midbrain dopamine neurons, and provided evidence that this activity corresponds to uncertainty about reward. An alternative interpretation of our observations was recently put forth in which it was suggested that the delay-period activity corresponds not to uncertainty but to backpropagating TD prediction errors. Here we present evidence that supports our original proposal but appears inconsistent with the alternative interpretation involving backpropagating errors.
Because the activity of dopamine neurons appears to code reward prediction error, it has been suggested that dopamine neurons may provide a teaching signal in analogy to the prediction error found in temporal difference (TD) models of reinforcement learning. Taking the analogy a step further, it has also been proposed that particular TD models may describe the activity of dopamine neurons [1, 2]. More recently, we have reported that dopamine neurons show a gradual increase in activity that occurs between onset of a conditioned stimulus (CS) and reward when the CS is associated with uncertainty about the reward outcome . Niv et al  have now suggested how a conventional TD model might account for this observation without reference to uncertainty.
Their explanation relies on the fact that, in certain TD models, prediction errors "backpropagate" in time over consecutive CS presentations. In our experiments, on a particular trial a prediction error occurs immediately after reward onset, which occurs 2 seconds after CS onset. According to the backpropagation model favored by Niv et al, on the next trial in which that same CS is presented, an internally timed "prediction error" would occur at a shorter delay, perhaps at 1.9 seconds after CS onset. On each subsequent trial, the error would occur at a shorter delay until finally it immediately follows the onset of the CS. This model would require that neurons show sudden increases or decreases in activity at long but precisely timed delays after stimulus onset. Although the implementation of such a scheme by real neurons is questionable, it nonetheless might account for the observed delay period activation if one makes the additional assumption that neuronal firing rate has a particular nonlinear relationship to prediction error. For example, Niv et al argue that the difference between 1 and 2 spikes per second has a much greater functional impact in terms of prediction error than the difference between 9 and 10 spikes per second. Thus, adding activity across trials, as we did to generate histograms, would result in the appearance of neuronal activation despite the fact that the average activity at all times (except immediately after CS onset) would correspond to a prediction error of zero. Below we present some of the reasons that we are skeptical of the interpretation of Niv et al.
First, the nonlinear relationship suggested by Niv et al between the firing rate of dopamine neurons and the functional prediction error is opposite to the experimentally observed nonlinear relationship between firing rate and dopamine concentration in mesolimbic target regions. Chergui et al  found that there is more extracellular dopamine per impulse at higher firing rates than at lower firing rates.
Third, additional analysis of the data (averaged over trials) challenges the interpretation of Niv et al. If the activity during the delay period is due to backpropagating "error" signals that originated in previous trials, then the activity in the last part of the delay period should reflect the reward outcome that followed the last exposure to that same CS. Thus there should be more activity at the end of the delay period if the last trial was rewarded, and less if it was unrewarded. We have analyzed trials in which the CS predicted reward at p = 0.5, and found no dependence of neural activity on the outcome of the preceding trial of the same CS (Fig. 1A,B) (comparing either the last 100 or 500 ms before reward: p > 0.05 in 51 of 54 neurons, Mann-Whitney test; p > 0.4 for the population of 54 neurons, Wilcoxon test). Thus the delay-period activity does not appear to depend on the outcome of the last trial, as suggested by Niv et al.
Fourth, our more recently published results  are inconsistent with the model of Niv et al. Each of three conditioned stimuli predicted two potential reward outcomes of equal probability. The discrepancy in liquid volume between the two potential reward outcomes varied according to the CS. The greater the discrepancy, the more pronounced was the sustained, ramp-like increase in neural activity (Fig 2A) . However, the phasic response following reward (or omission of reward) was identical across the three conditions, revealing an adaptation of the prediction error response to the expected discrepancy in reward magnitude (Fig. 2B) . If one were to incorporate these recently published results  into the backpropagation TD model of Niv et al, then one would find that since the reward prediction error response at the end of each trial in these experiments is the same, the delay-period activity representing the backpropagating errors would also be the same. However, the data are inconsistent with the model, since the delay period activity increases with the discrepancy between potential reward magnitudes (Fig. 2A) . Our results  show that although the phasic activity of dopamine neurons corresponds well to a general definition of reward prediction error, it is inconsistent with the explanation of the delay period activity proposed by Niv et al.
Fifth, it should be noted that the backpropagating prediction error in the model of Niv et al does not reflect an inherent necessity of TD models, but is rather a consequence of the specific temporal stimulus representation chosen. The implementation of different temporal stimulus representations can lead to quite different results. The original TD model  and recent versions  have used temporal stimulus representations in which the transfer of the neuronal response to the CS is accomplished in a manner that appears more biologically plausible than backpropagation. In TD models utilizing backpropagation, neural signals during the delay period are precisely timed but are without functional consequence, since the sequence of positive and negative errors are self-generated (occurring in the absence of any external events) but are presumed to cancel each other out. This strikes us an odd notion that is neither efficient, nor elegant, nor necessary to the principles of TD learning.
This work was supported by the Wellcome Trust (W.S. and P.N.T.) and the Howard Hughes Medical Institute (C.D.F.).
- Montague PR, Dayan P, Sejnowski TJ: A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci. 1996, 16: 1936-1947.PubMedGoogle Scholar
- Schultz W, Dayan P, Montague PR: A neural substrate of prediction and reward. Science. 1997, 275: 1593-1599. 10.1126/science.275.5306.1593.View ArticlePubMedGoogle Scholar
- Fiorillo CD, Tobler PN, Schultz W: Discrete coding of reward probability and uncertainty by dopamine neurons. Science. 2003, 299: 1898-1902. 10.1126/science.1077349.View ArticlePubMedGoogle Scholar
- Niv Y, Duff MO, Dayan P: Dopamine, uncertainty and TD learning. Behav Brain Func. 2005, 1: 6-10.1186/1744-9081-1-6.View ArticleGoogle Scholar
- Chergui K, Suaud-Chagny MF, Gonon F: Nonlinear relationship between impulse flow, dopamine release and dopamine elimination in the rat brain in vivo. Neuroscience. 1994, 62: 641-645. 10.1016/0306-4522(94)90465-0.View ArticlePubMedGoogle Scholar
- Takikawa Y, Kawagoe R, Hikosaka O: Reward-dependent spatial selectivity of anticipatory activity in monkey caudate neurons. J Neurophysiol. 2002, 87: 508-515.PubMedGoogle Scholar
- Janssen P, Shadlen MN: A representation of the hazard rate of elapsed time in macaque area LIP. Nat Neurosci. 2005, 8: 234-241. 10.1038/nn1386.View ArticlePubMedGoogle Scholar
- Tobler PN, Fiorillo CD, Schultz W: Adaptive coding of reward value by dopamine neurons. Science. 2005, 307: 1642-1645. 10.1126/science.1105370.View ArticlePubMedGoogle Scholar
- Sutton RS, Barto AG: Toward a modern theory of adaptive networks: expectation and prediction. Psychol Rev. 1981, 88: 135-170. 10.1037//0033-295X.88.2.135.View ArticlePubMedGoogle Scholar
- Suri RE, Schultz W: A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience. 1999, 91: 871-890. 10.1016/S0306-4522(98)00697-6.View ArticlePubMedGoogle Scholar
- Hollerman JR, Schultz W: Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neurosci. 1998, 1: 304-309. 10.1038/1124.View ArticlePubMedGoogle Scholar
- Brown J, Bullock D, Grossberg S: How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J Neurosci. 1999, 19: 10502-10511.PubMedGoogle Scholar
- Kakade S, Dayan P: Dopamine: generalization and bonuses. Neural Networks. 2002, 15: 549-559. 10.1016/S0893-6080(02)00048-5.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.