Dopamine, uncertainty and TD learning
© Niv et al. 2005
Received: 12 February 2005
Accepted: 04 May 2005
Published: 04 May 2005
Substantial evidence suggests that the phasic activities of dopaminergic neurons in the primate midbrain represent a temporal difference (TD) error in predictions of future reward, with increases above and decreases below baseline consequent on positive and negative prediction errors, respectively. However, dopamine cells have very low baseline activity, which implies that the representation of these two sorts of error is asymmetric. We explore the implications of this seemingly innocuous asymmetry for the interpretation of dopaminergic firing patterns in experiments with probabilistic rewards which bring about persistent prediction errors. In particular, we show that when averaging the non-stationary prediction errors across trials, a ramping in the activity of the dopamine neurons should be apparent, whose magnitude is dependent on the learning rate. This exact phenomenon was observed in a recent experiment, though being interpreted there in antipodal terms as a within-trial encoding of uncertainty.
There is an impressively large body of physiological, imaging, and psychopharmacological data regarding the phasic activity of dopaminergic (DA) cells in the midbrains of monkeys, rats and humans in classical and instrumental conditioning tasks involving predictions of future rewards [1–5]. These data have been taken to suggest [6, 7] that the activity of DA neurons represents temporal difference (TD) errors in the predictions of future reward [8, 9]. This TD theory of dopamine provides a precise computational foundation for understanding a host of behavioural and neural data. Furthermore, it suggests that DA provides a signal that is theoretically appropriate for controlling learning of both predictions and reward-optimising actions.
Some of the most compelling evidence in favour of the TD theory comes from studies investigating the phasic activation of dopamine cells in response to arbitrary stimuli (such as fractal patterns on a monitor) that predict the proximate availability of rewards (such as drops of juice). In many variants, these have shown that with training, phasic DA signals transfer from the time of the initially unpredictable reward, to the time of the earliest cue predicting a reward. This is exactly the expected outcome for a temporal-difference based prediction error (eg. [1, 2, 10–13]). The basic finding  is that when a reward is unexpected (which is inevitable in early trials), dopamine cells respond strongly to it. When a reward is predicted, however, the cells respond to the predictor, and not to the now-expected reward.
If a predicted reward is unexpectedly omitted, then the cells are phasically inhibited at the normal time of the reward, an inhibition which reveals the precise timing of the reward prediction , and whose temporal metrics are currently under a forensic spotlight . The shift in activity from the time of reward to the time of the predictor resembles the shift of the animal's appetitive behavioural reaction from the time of the reward (the unconditioned stimulus) to that of the conditioned stimulus in classical conditioning experiments [7, 10].
In a most interesting recent study, Fiorillo et al.  examined the case of partial reinforcement, in which there is persistent, ineluctable, prediction error on every single trial. A straightforward interpretation of the TD prediction error hypothesis would suggest that in this case (a) dopamine activity at the time of the predictive stimuli would scale with the probability of reward, and (b) on average over trials, the dopaminergic response after the stimulus and all the way to the time of the reward, should be zero. Although the first hypothesis was confirmed in the experiments, the second was not. The between-trial averaged responses showed a clear ramping of activity during the delay between stimulus onset and reward that seemed inconsistent with the TD account. Fiorillo et al. hypothesised that this activity represents the uncertainty in reward delivery, rather than a prediction error.
In this paper, we visit the issue of persistent prediction error. We show that a crucial asymmetry in the coding of positive and negative prediction errors leads one to expect the ramping in the between-trial average dopamine signal, and also accounts well for two further features of the DA signal – apparent persistent activity at the time of the (potential) reward, and disappearance (or at least weakening) of the ramping signal, but not the signal at the time of reward, in the face of trace rather than delay conditioning. Both of these phenomena have also been observed in the related instrumental conditioning experiments of Morris et al. . Finally, we interpret the ramping signal as the best evidence available at present for the nature of the learning mechanism by which the shift in dopamine activity to the time of the predictive stimuli occurs.
Uncertainty in reward occurrence: DA ramping
Fiorillo et al.  associated the presentation of five different visual stimuli to macaques with the delayed, probabilistic (p r = 0, 0.25, 0.5, 0.75, 1) delivery of juice rewards. They used a delay conditioning paradigm, in which the stimulus persists for a fixed interval of 2s, with reward being delivered when the stimulus disappears. After training, the monkeys' anticipatory licking behavior indicated that they were aware of the different reward probabilities associated with each stimulus.
By contrast, at the time of potential reward delivery, TD theory predicts that on average there should be no activity, as, on average, there is no prediction error at that time. Of course, in the probabilistic reinforcement design (at least for p r ≠ 0, 1) there is in fact a prediction error at the time of delivery or non-delivery of reward on every single trial. On trials in which a reward is delivered, the prediction error should be positive (as the reward obtained is larger than the average reward expected). Conversely, on trials with no reward it should be negative (see Figure 1c). Crucially, under TD, the average of these differences, weighted by their probabilities of occurring, should be zero. If it is not zero, then this prediction error should act as a plasticity signal, changing the predictions until there is no prediction error. At variance with this expectation, the data in Figure 1a which is averaged over both rewarded and unrewarded trials, show that there is in fact positive mean activity at this time. This is also evident in the data of Morris et al.  (see Figure 3c). The positive DA responses show no signs of disappearing even with substantial training (over the course of months).
Worse than this for the TD model, and indeed the focus of Fiorillo et al. , is the apparent ramping of DA activity towards the expected time of the reward. As the magnitude of the ramp is greatest for p r = 0.5, Fiorillo et al. suggested that it reports the uncertainty in reward delivery, rather than a prediction error, and speculated that this signal could explain the apparently appetitive properties of uncertainty (as seen in gambling).
Both the ramping activity and the activity at the expected time of reward pose critical challenges to the TD theory. TD learning operates by arranging for DA activity at one time in a trial to be predicted away by cues available earlier in that trial. Thus, it is not clear how any seemingly predictable activity, be it that at the time of the reward or in the ramp before, can persist without being predicted away by the onset of the visual stimulus. After all, the p r -dependent activity in response to the stimulus confirms its status as a valid predictor. Furthermore, a key aspect of TD , is that it couples prediction to action choice by using the value of a state as an indication of the future rewards available from that state, and therefore its attractiveness as a target for action. From this perspective, since the ramping activity is explicitly not predicted by the earlier cue, it cannot influence early actions, such as the decision to gamble. For instance, consider a competition between two actions: one eventually leading to a state with a deterministic reward and therefore no ramp, and the other leading to a state followed by a probabilistic reward with the same mean, and a ramp. Since the ramp does not affect the activity at the time of the conditioned stimulus, it cannot be used to evaluate or favour the second action (gambling) over the first, despite the extra uncertainty.
We suggest the alternative hypothesis that both these anomalous firing patterns result directly from the constraints implied by the low baseline rate of activity of DA neurons (2–4 Hz) on the coding of the signed prediction error. As noted by Fiorillo et al. , positive prediction errors are represented by firing rates of ~270% above baseline, while negative errors are represented by a decrease of only ~55% below baseline (see also [14, 18]). This asymmetry is a straightforward consequence of the coding of a signed quantity by firing which has a low baseline, though, obviously, can only be positive. Firing rates above baseline can encode positive prediction errors by using a large dynamic range, however, below baseline firing rates can only go down to zero, imposing a restriction on coding of negative prediction errors.
Consequently, one has to be careful interpreting the sums (or averages) of peri-stimulus-time-histograms (PSTHs) of activity over different trials, as was done in Figure 1a. The asymmetrically coded positive and negative error signals at the time of the receipt or non-receipt of reward should indeed not sum up to zero, even if they represent correct TD prediction errors. When summed, the low firing representing the negative errors in the unrewarded trials will not "cancel out" the rapid firing encoding positive errors in the rewarded trials, and, overall, the average will show a positive response. In the brain, of course, as responses are not averaged over (rewarded and unrewarded) trials, but over neurons within a trial, this need not pose a problem.
This explains the persistent positive activity (on average) at the time of delivery or non-delivery of the reward. But what about the ramp prior to this time? At least in certain neural representations of the time between stimulus and reward, when trials are averaged, this same asymmetry leads TD to result exactly in a ramping of activity toward the time of the reward. The TD learning mechanism has the effect of propagating, on a trial-by-trial basis, prediction errors that arise at one time in a trial (such as at the time of the reward) towards potential predictors (such as the CS) that arise at earlier times within each trial. Under the asymmetric representation of positive and negative prediction errors that we have just discussed, averaging these propagating errors over multiple trials (as in Figure 1a) will lead to positive means for epochs within a trial before a reward. The precise shape of the resulting ramp of activity depends on the way stimuli are represented over time, as well as on the speed of learning, as will be discussed below.
Figures 1b,d show the ramp arising from this combination of asymmetric coding and inter-trial averaging, for comparison with the experimental data. Figure 1b shows the PSTH computed from our simulated data by averaging over the asymmetrically-represented δ(t) signal in ~50 trials for each stimulus type. Figure 1d shows the results for the p r = 0.5 case, divided into rewarded and unrewarded trials for comparison with Figure 1c. The simulated results resemble the experimental data closely in that they replicate the net positive response to the uncertain rewards, as well as the ramping effect, which is highest in the p r = 0.5 case.
It is simple to derive the average response at the time of the reward (t = N) in trial T, i.e., the average TD error 〈 δ T (N)〉, from the TD learning rule with the simplified tapped delay-line time representation and a fixed learning rate α. The value at the next to last timestep in a trial, as a function of trial number (with initial values taken to be zero), is
where r(t) is the reward at the end of trial t. The error signal at the last timestep of trial T is simply the difference between the obtained reward r(T), and the value predicting that reward V T - 1 (N - 1). This error is positive with probability p r , and negative with probability (1 - p r ). Scaling the negative errors by a factor of d ∈ (0, 1], we thus get
For symmetric coding of positive and negative errors (d = 1), the average response is 0. For asymmetric coding (0 <d < 1), the average response is indeed proportional to the variance of the rewards, and thus maximal at p r = 0.5. However, δ T is positive, and concomitantly, the ramps are positive, and in this particular setting, are related to uncertainty, because of, rather than instead of, the coding of δ(t).
Indeed, there is a key difference between the uncertainty and TD accounts of the ramping activity. According to the former, the ramping is a within-trial phenomena, coding uncertainty in reward; by contrast, the latter suggests that ramps arise only through averaging across multiple trials. Within a trial, when averaging over simultaneously recorded neurons rather than trials, the traces should not show a smooth ramp, but intermittent positive and negative activity corresponding to back-propagating prediction errors from the immediately previous trials (as in Figure 2a).
Trace conditioning: a test case
Indeed, compared to delay conditioning, trace conditioning is notoriously slow, suggesting that the learning rate is low, and thus that there should be a lower ramp, in accord with the experimental results. A direct examination of the learning rate in the data of Morris et al. , whose task required excessive training as it was not only a trace conditioning one but also involved an instrumental action, confirmed it indeed to be very low (Genela Morris – personal communication, 2004).
The differential coding of positive and negative values by DA neurons is evident in all the studies of the phasic DA signal, and can be regarded as an inevitable consequence of the low baseline activity of these neurons. Indeed, the latter has directly inspired suggestions that an opponent neurotransmitter, putatively serotonin, be involved in representing and therefore learning the negative prediction errors , so that they also have full quarter. Here, however, we have confined ourselves to consideration of the effects of asymmetry on the trial-average analysis of the dopamine activity, and have shown that ramping DA activity, as well as an average positive response at the time of reward, result directly from the asymmetric coding of prediction errors.
Apart from a clearer view of the error signal, the most important consequence of the new interpretation is that the ramps can be seen as a signature of a TD phenomenon that has hitherto been extremely elusive. This is the progressive back-propagation of the error signal represented by DA activity, from the time of reward to the time of the predictor (Figure 2a). Most previous studies of dopaminergic activity have used p r = 1, so making this back-propagation at best a transitory phenomenon apparent only at the beginning of training (when, typically, recordings have not yet begun), and potentially hard to discern in slow-firing DA neurons. Further, as mentioned, the back-propagation depends on the way that the time between the predictive stimulus and the reward is represented – it is present for a tapped delay-line representation as in , but not for representations which span the entire delay, such as in . Note that the shape of the ramp also depends on the use of eligibility traces and the so-called TD(λ) learning rule (simulation not shown), which provide an additional mechanism for bridging time between events during learning. Unfortunately, as the forms of the ramps in the data are rather variable (figure 1) and noisy, they can not provide strong constraints on the precise TD mechanism used by the brain.
More recent studies involving persistent prediction errors also show activity suggestive of back-propagation, notably Figure 4 of . In this study, prediction errors resulted from periodic changes in the task, and DA recordings were made from the onset of training, thus back-propagation-like activity is directly apparent, although this activity was not quantified.
We expect the ramps to persist throughout training only if the learning rate does not decrease to zero as learning progresses. Pearce & Hall's  theory of the control of learning by uncertainty suggests exactly this persistence of learning – and there is evidence from partial reinforcement schedules that the learning rate may be higher when there is more uncertainty associated with the reward. Indeed, from a 'rational' statistical point of view, learning should persist when there is substantial uncertainty about the relationship between predictors and outcomes, as can arise from the ever-present possibility of a change in the predictive relationships. This form of persistent uncertainty, together with uncertainty due to initial ignorance regarding the task, have been used to formalize Pearce & Hall's theory of the way that uncertainty drives learning . Thus, our claim that uncertainty may not be directly represented by the ramps, should certainly not be taken to mean that its representation and manipulation is not important. To the contrary, we have suggested that uncertainty influences cortical inference and learning through other neuromodulatory systems , and that it also may determine aspects of the selection of actions .
Various other features of the asymmetry should be noted. Most critical is the effect of the asymmetry on DA-dependent learning , if the below baseline DA activity is responsible by itself for decreasing predictions which are too high. In order to ensure that the learned predictions remain correct, we would have to assume that the asymmetric representation does not affect learning, i.e., that a mechanism such as different scaling for potentiation and depression of the synaptic strengths compensates for the asymmetric error signal. Of course, this would be rendered moot if an opponent neurotransmitter is involved in learning from negative prediction errors. This issue is complicated by the suggestion of Bayer  that DA firing rates are actually similar for all prediction errors below some negative threshold, perhaps due to the floor effect of the low firing rate. Such lossy encoding does not affect the qualitative picture of the effects of inter-trial averaging on the emergence of ramps, but does reinforce the need for an opponent signal for the necessarily symmetric learning.
Finally, the most direct test of our interpretation would be a comparison of intra- and inter-trial averaging of the DA signal. It would be important to do this in a temporally sophisticated manner, to avoid problems of averaging non-stationary signals. In order to overcome the noise in the neural firing, and determine whether indeed there was a gradual ramp within a trial, or, as we would predict – intermittent positive and negative prediction errors, it would be necessary to average over many neurons recorded simultaneously within one trial, and furthermore neurons associated with similar learning rates. Alternatively, single neuron traces could be regressed against the backpropagation response predicted by their preceding trials and TD learning. A comparison of the amount of variability explained by such a model, compared to that from a regression against a monotonic ramp of activity, could point to the most fitting model. A less straightforward, but more testable prediction is that the shape of the ramp should depend on the learning rate. Learning rates can be assessed from the response to the probabilistic rewards, independent of the shape of the ramp (Nakahara et al.  showed in such a way, that in their partial reinforcement trace conditioning task, the learning rate was 0.3), and potentially manipulated by varying the amount of training or the frequency with which task contingencies are changed and relearned. Indeed, quantifying the existence and shape of a ramp in Nakahara et al.'s recorded DA activity, could well shed light on the current proposal.
We are very grateful to H. Bergman, C. Fiorillo, N. Daw, D. Joel, P. Tobler, P. Shizgal and W. Schultz for discussions and comment, in some cases despite varying interpretation of the data. We are particularly grateful to Genela Morris for analyzing her own published and unpublished data in relation to ramping. This work was funded by the EC Thematic Network (YN), the Gatsby Charitable Foundation and the EU BIBA project.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.