- Open Access
Evidence that the delay-period activity of dopamine neurons corresponds to reward uncertainty rather than backpropagating TD errors
Behavioral and Brain Functionsvolume 1, Article number: 7 (2005)
We previously demonstrated the presence of delay-period activity in midbrain dopamine neurons, and provided evidence that this activity corresponds to uncertainty about reward. An alternative interpretation of our observations was recently put forth in which it was suggested that the delay-period activity corresponds not to uncertainty but to backpropagating TD prediction errors. Here we present evidence that supports our original proposal but appears inconsistent with the alternative interpretation involving backpropagating errors.
Because the activity of dopamine neurons appears to code reward prediction error, it has been suggested that dopamine neurons may provide a teaching signal in analogy to the prediction error found in temporal difference (TD) models of reinforcement learning. Taking the analogy a step further, it has also been proposed that particular TD models may describe the activity of dopamine neurons [1, 2]. More recently, we have reported that dopamine neurons show a gradual increase in activity that occurs between onset of a conditioned stimulus (CS) and reward when the CS is associated with uncertainty about the reward outcome . Niv et al  have now suggested how a conventional TD model might account for this observation without reference to uncertainty.
Their explanation relies on the fact that, in certain TD models, prediction errors "backpropagate" in time over consecutive CS presentations. In our experiments, on a particular trial a prediction error occurs immediately after reward onset, which occurs 2 seconds after CS onset. According to the backpropagation model favored by Niv et al, on the next trial in which that same CS is presented, an internally timed "prediction error" would occur at a shorter delay, perhaps at 1.9 seconds after CS onset. On each subsequent trial, the error would occur at a shorter delay until finally it immediately follows the onset of the CS. This model would require that neurons show sudden increases or decreases in activity at long but precisely timed delays after stimulus onset. Although the implementation of such a scheme by real neurons is questionable, it nonetheless might account for the observed delay period activation if one makes the additional assumption that neuronal firing rate has a particular nonlinear relationship to prediction error. For example, Niv et al argue that the difference between 1 and 2 spikes per second has a much greater functional impact in terms of prediction error than the difference between 9 and 10 spikes per second. Thus, adding activity across trials, as we did to generate histograms, would result in the appearance of neuronal activation despite the fact that the average activity at all times (except immediately after CS onset) would correspond to a prediction error of zero. Below we present some of the reasons that we are skeptical of the interpretation of Niv et al.
First, the nonlinear relationship suggested by Niv et al between the firing rate of dopamine neurons and the functional prediction error is opposite to the experimentally observed nonlinear relationship between firing rate and dopamine concentration in mesolimbic target regions. Chergui et al  found that there is more extracellular dopamine per impulse at higher firing rates than at lower firing rates.
Second, inspection of the published data appears inconsistent with the model of Niv et al. They suggest that the delay-period activity is an artifact of averaging over trials to generate histograms, and that the sustained increase in activity does not occur in single trials. Contrary to their proposal, there does appears to be strong and sustained activation within single trials, as shown in figure 2 of our original report  and in data from another neuron shown here in figure 1A. It is difficult to be certain whether or not activity increases on single trials, in part because Niv et al have not specified precisely what a single-trial increase in delay-period activity should look like, and in part because of the general problem within neuroscience of how to interpret spike trains. Indeed, it would seem that any spike could conceivably represent a backpropagating positive error, and any inter-spike interval could correspond to a negative error. However, if we take a more constrained, conventional approach based on firing rates over tens of milliseconds, then firing rate appears to increase during the delay period on single trials. Similarly, gradual changes in neural activity related to reward expectation are observed in many other types of neuron [[6, 7], for example], and are widely believed to represent meaningful increases in activity on single trials rather than artifacts of averaging over trials.
Third, additional analysis of the data (averaged over trials) challenges the interpretation of Niv et al. If the activity during the delay period is due to backpropagating "error" signals that originated in previous trials, then the activity in the last part of the delay period should reflect the reward outcome that followed the last exposure to that same CS. Thus there should be more activity at the end of the delay period if the last trial was rewarded, and less if it was unrewarded. We have analyzed trials in which the CS predicted reward at p = 0.5, and found no dependence of neural activity on the outcome of the preceding trial of the same CS (Fig. 1A,B) (comparing either the last 100 or 500 ms before reward: p > 0.05 in 51 of 54 neurons, Mann-Whitney test; p > 0.4 for the population of 54 neurons, Wilcoxon test). Thus the delay-period activity does not appear to depend on the outcome of the last trial, as suggested by Niv et al.
Fourth, our more recently published results  are inconsistent with the model of Niv et al. Each of three conditioned stimuli predicted two potential reward outcomes of equal probability. The discrepancy in liquid volume between the two potential reward outcomes varied according to the CS. The greater the discrepancy, the more pronounced was the sustained, ramp-like increase in neural activity (Fig 2A) . However, the phasic response following reward (or omission of reward) was identical across the three conditions, revealing an adaptation of the prediction error response to the expected discrepancy in reward magnitude (Fig. 2B) . If one were to incorporate these recently published results  into the backpropagation TD model of Niv et al, then one would find that since the reward prediction error response at the end of each trial in these experiments is the same, the delay-period activity representing the backpropagating errors would also be the same. However, the data are inconsistent with the model, since the delay period activity increases with the discrepancy between potential reward magnitudes (Fig. 2A) . Our results  show that although the phasic activity of dopamine neurons corresponds well to a general definition of reward prediction error, it is inconsistent with the explanation of the delay period activity proposed by Niv et al.
Fifth, it should be noted that the backpropagating prediction error in the model of Niv et al does not reflect an inherent necessity of TD models, but is rather a consequence of the specific temporal stimulus representation chosen. The implementation of different temporal stimulus representations can lead to quite different results. The original TD model  and recent versions  have used temporal stimulus representations in which the transfer of the neuronal response to the CS is accomplished in a manner that appears more biologically plausible than backpropagation. In TD models utilizing backpropagation, neural signals during the delay period are precisely timed but are without functional consequence, since the sequence of positive and negative errors are self-generated (occurring in the absence of any external events) but are presumed to cancel each other out. This strikes us an odd notion that is neither efficient, nor elegant, nor necessary to the principles of TD learning.
When discrepancies between TD models and responses of dopamine neurons have been noted in the past, such as the absence of depression at the usual time of reward on trials in which reward is delivered earlier than usual , TD models have been modified accordingly to better describe the neural activity [10, 12, 13]. Although TD models have proven very useful, one would not necessarily expect to find the formal structure of any current TD model implemented in the brain. Present TD models exhibit a number of characteristics that appear to be motivated by the need for simplification rather than by any empirical or theoretical constraint. For example, the prediction in TD models is typically equated with expected reward value and ignores the uncertainty in prediction. To illustrate this in terms familiar to people (and perhaps also of relevance to dopamine neurons), a 10% chance of gaining $100 is clearly not equivalent to a 100% chance of gaining $10, yet TD models do not discriminate amongst these two scenarios. TD models have evolved over the years to become more useful and realistic. We believe this process will continue and hope that the study of neurons might be helpful in this regard.
Montague PR, Dayan P, Sejnowski TJ: A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci. 1996, 16: 1936-1947.
Schultz W, Dayan P, Montague PR: A neural substrate of prediction and reward. Science. 1997, 275: 1593-1599. 10.1126/science.275.5306.1593.
Fiorillo CD, Tobler PN, Schultz W: Discrete coding of reward probability and uncertainty by dopamine neurons. Science. 2003, 299: 1898-1902. 10.1126/science.1077349.
Niv Y, Duff MO, Dayan P: Dopamine, uncertainty and TD learning. Behav Brain Func. 2005, 1: 6-10.1186/1744-9081-1-6.
Chergui K, Suaud-Chagny MF, Gonon F: Nonlinear relationship between impulse flow, dopamine release and dopamine elimination in the rat brain in vivo. Neuroscience. 1994, 62: 641-645. 10.1016/0306-4522(94)90465-0.
Takikawa Y, Kawagoe R, Hikosaka O: Reward-dependent spatial selectivity of anticipatory activity in monkey caudate neurons. J Neurophysiol. 2002, 87: 508-515.
Janssen P, Shadlen MN: A representation of the hazard rate of elapsed time in macaque area LIP. Nat Neurosci. 2005, 8: 234-241. 10.1038/nn1386.
Tobler PN, Fiorillo CD, Schultz W: Adaptive coding of reward value by dopamine neurons. Science. 2005, 307: 1642-1645. 10.1126/science.1105370.
Sutton RS, Barto AG: Toward a modern theory of adaptive networks: expectation and prediction. Psychol Rev. 1981, 88: 135-170. 10.1037//0033-295X.88.2.135.
Suri RE, Schultz W: A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience. 1999, 91: 871-890. 10.1016/S0306-4522(98)00697-6.
Hollerman JR, Schultz W: Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neurosci. 1998, 1: 304-309. 10.1038/1124.
Brown J, Bullock D, Grossberg S: How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J Neurosci. 1999, 19: 10502-10511.
Kakade S, Dayan P: Dopamine: generalization and bonuses. Neural Networks. 2002, 15: 549-559. 10.1016/S0893-6080(02)00048-5.
This work was supported by the Wellcome Trust (W.S. and P.N.T.) and the Howard Hughes Medical Institute (C.D.F.).