584 Matching Annotations
  1. Oct 2020
    1. episodic memory encod-ing and retrieval

      It 'recruits' the hippocampus?

    2. stimulus-locked P2-N2 complex and theresponse-locked ERN/CRN

      These are THE components! (ERN is also stimulus locked if you present a feedback screen?)

    3. may fail to communicate common features of these com-ponents

      Valid, but different goal?

    4. differentiated literature of action-monitoring ERPs

      valid

    5. may reflect disparate processes in unique cognitivecircumstances, these aforementioned processes have all been spe-cifically associated with ACC function during attention orientationand/or action selection

      This paper does not discredit the idea that different types of N2 and the reward positivity are epistemically different processes

    6. the integration of contextual cues with actionselection to optimize goal-driven performance

      their conceptualization of ACC function

    Annotators

    1. These differences cannot beattributed to greater engagement of the NE system in the ActiveExperiment (rather than greater DA system engagement) becausegreater NE release would have produced alargernegativity toinfrequent reward feedback

      This is how DA and NE are dissociated in terms of their predictions!

    2. in the Active Experiment, the rawN2 to infrequent reward feedback was significantly smaller thanthe raw N2 to infrequent reward feedback in both the Passive andModerate Experiments, suggesting that greater DA system engage-ment resulted in a larger DA-associated positivity that attenuatedthe raw N2.

      So here DA is independent of NE

    3. interaction of reward condition andfrequency condition such that the effect of reward was largerin the infrequent condition than in the frequent condition

      This is interesting! NE modulated DA so that it obscures N2 even more?

    4. dN2in the Moderate Experiment trended toward being significantlylarger than the dN2 in Active Experiment,

      not significant

    Annotators

  2. Sep 2020
    1. identical task stimuli (colored faces) presented with identicaltask designs

      It is compelling that task set can yield such a difference -> it is definitely somewhat goal-related

    2. variablescalp distribution dependent on relative engagement of the dif-ferent cortical areas giving rise to the ERP

      So then neither N2 nor P3 have a very easy to localize or 'centered' scalp distribution -> but the N2 does have a localizable distribution (acc?)

    3. LC refractory period coincides withP3 generation

      So this is the devil stirring the pot

    4. increase in amplitude of the N2

      But it has also been theorized to be the source of the P300? They are deflections in opposite directions, how??

    Annotators

    1. By taking into account P3b versusP3a effects and latency information, it may be possible to con-sider surprise in the context of other mental states contributing togoal-oriented behavior

      If ACC is very goal-heavy in its processing signature this could be something to look into!

    2. cannotdetermine whether the P300 modulation was purely due to thesurprise conveyed by the visual stimuli, or whether it was relatedto the response selection on each trial

      P300 in passive viewing, or related to the giving of response (motor preparatory process could very well play a role! would not happen in the case of passive viewing)

    3. unexpected changes in the worldwithin the context of a task

      This seems like a very RL model - esque claim though!

    4. P300 reflects the arrival of a phasic norepinephrine (NE) signal incortical areas, which serves to increase signal transmission in thecortex

      This could be interesting, especially if it is an LC-NE system link! Pupillometry will add some info in that case, and read the papers about acc-Lc-Ne

    5. compared our model to an alternative mea-sure of surprise based on the Kullback–Leibler divergence.

      KL-divergence is what is used in the O'Reilly study right? So there are some differences here. There, there is a clear behaviourally relevant model update (which might be reflected in some other process, although perhaps not even EEG measurable, since fERN seems more concerned with expectations?) Expectations of errors can of course be part of a task-model...

    6. This evidence was used to compare competing models de-fined in terms of the explanatory variables inZ1.

      Also set up model comparison for different regressors

    7. identity of trials on which participants responded er-roneously, trials that were rejected during the preprocessing of the EEGdata, and a constant term.

      nuisance modeled at the same time!

    8. regressor models variance related to stimulus probabilitywithin a block and does not take into account any learning

      Simply assume structure is known and use that to predict p300

    9. quantified in terms of how much they change posteriorbeliefs.

      So if prediction error is used to update the model, this is also intuitive -> in fact, in an RL setting, the notion of 'surprise' as prediction error, and 'model update', are identical.

    10. Harrison et al. (2006), where thecurrent event depended on the previous

      Look at their model for inspiration?

    11. Dj1, can beused to predict the probability of each event occurring

      This is actually what can make an (ordinal?) prediction about the P300 amplitude!

    12. nkjrefers to the number of occurrences of outcomekup untilobservationj

      Natural way of updating the multinomial is by registering the counts in the Dirichlet

    13. time point at which the averaged P300s were mod-ulated maximally

      Define the amplitude of a single-trial P300 by the averaged peak, separately for participants.

    14. All stimuli occurred equally oftenover the course of the experiment

      Good counterbalancing

    15. not informed

      crucial?

    16. updating of task-relevant information inanticipation of subsequent events

      This seems very much like it could be an SR thing!

    17. P300 has commonly been linked to therevision of a participant’s expectation about the current task con-text

      Context -> do they mean structure or task-set (for reward outcome?)

    Annotators

    1. Thepresent findings suggest that the variance in fERN amplitudeacross conditions results more from the effect of unpredictedpositive feedback than from unpredicted negative feedback

      So negative prediction error does not yield the same attentuating impact on N200 measurement as positive prediction error does for enhancing it?

    2. essentialdifference between conditions is associated with neural activityon correct trials rather than neural activity on error trials

      This is an important distinction to make - N200 as component shows mostly responsivity to infrequency. Original definition of fERN as a difference wave with correct trials was interpreted as different from N200, assumed because of corrective processes taking place after incorrect feedback. However, this study picks N200 vs fRxx apart and shows, difference in difference wave vs N200 is caused by more negativity on infrequent correct feedback trials

    3. error–oddball differ-ence should be smaller than the correct–oddball difference

      Makes sense - If something else is happening on the correct feedbacks that makes them look N200 absent

    4. what causesthe absence of the fERN/N200 on correct trials

      Exactly! The definition should then be the same as an oddball for the correct feedback

    5. in-frequent targets in an oddball task and infrequent error feedbackin a time estimation task both elicit a frontal or frontal-centrallydistributed, negative-going component that reaches maximumamplitude at approximately 280–310 ms.

      Suggestive of coming from an identical source with identical function!

    6. infrequent oddball ERPs from the infrequent correctERPs.

      N200 - correct hard condition -> also just an oddball in a sense

    7. subtracted the latency-corrected infrequent oddball ERPs fromthe infrequent error ERPs.

      n200 - fERN

    8. maximum negative value of the ERP recorded atchannel FCz within a 150–350-ms window

      Latency correction between conditions

    9. virtual-ERNs

      Spatial PCA -> select fronto-central electrodes

    10. first by examining the ERPs directly, and second by per-forming a spatial principal components analysis (PCA)

      This is to get the scalp distribution - That allows to compare possible sources for the N200 and fERN

    11. difference between the ERP oncorrect trials and the N200 should be larger than the differencebetween the fERN and the N200

      If fERN == N200, subtractive difference between them should be very small. The difference with the correct trial ERP (fCRP) would be larger!

    12. it is equally possible that thedifference between the ERPs on correct and incorrect trials arisesfrom a process associated with correct trials rather than witherror trials

      So that would mean the fERN is NOT some component that is elicited by errors specifically - actually it is elicited by the correct feedbacks?? => completely invalidates the idea of behavioural adjustment through ACC inhibitory release?

    13. entirelydifferent ERP component as the source of the apparent variancein fERN amplitude.

      What exactly does this mean?

    14. fERN is elicited by unexpected negativefeedback stimuli, but not by unexpected positive feedback stimuli

      So that is a clear distinction with the N200 - it is modulated by correct v incorrectness? It was also modulated by expectation of the feedback...

    15. N200 increases in pro-portion to the unexpectedness of the event

      So that is a good candidate for SR vector error magnitude?

    16. we refer here to the negativedeflection that is seen in so-called oddball tasks

      N200 is typical oddball ERP

    Annotators

    1. confound fERN amplitude with the P300

      Also something to watch out for!

    2. resulted from an increase in the amplitude of thefERN, rather than from overlap with a different ERP compo-nent (such as the P300).

      Scalp distribution used to identify source (p300 vs ERN)

    3. nteraction of valence with expec-tancy

      unexpected feedback leads to more modification of RT -> errors in easy biggest, correct in hard smallest

    4. no main effect ofexpectancy

      Pure expectancy of feedback does not modulate behaviour

    5. main effect of valence

      change in response time correlated with previous valence (correct v incorrect) -> duh because window size change?

    6. Participants made more errors in the hardcondition (76%) than in the easy condition (23%)

      This is of course what is kind of ensured by that moving window approach! smart

    7. correct ERP in the easy condition fromthe error ERP in the hard condition

      Expected conditions subtracted -> again what is left?

    8. correct ERP in the hard conditionfrom the error ERP in the easy condition

      In both cases this is the unexpected condition -> unexpectedness gets removed but what is left?

    9. correct feedback in the hard con-dition

      The ERN would be bigger for correct responses? -> That's funny, opposite of what the name indicates. But it is because it is modulated by what is expected => EXPECTATION error

    Annotators

    1. sensory events [80]

      OFC sensory prediction

    2. C wasassociated with X before any association with food

      And never after, so the SR error never propagated

    3. SR represents the association between the stimulusandfood,andisalsoabletoupdatetherewardfunctionofthefood as a result of devaluation

      But the transitions are initially learned policy-dependent, which means inflexible! This requires SR-dyna style updated, or formed SR through undirected exploration?

    4. A reduced acquisition of conditionedresponding to C and D, compared to F, which was trainedin compound with a novel stimulus

      A was already directly associated with X -> AD and AC to X showed 'blocking' compared to EF which was novel preceded

    5. shifts in value (amount of reward) and identity(reward flavour).

      So it seems the common idea here is: Change in the reward stimulus, not the state-state (prior to reward) transitions. You can change both reward value and identity and see if it has a modulatory effect using this model!

    6. although the RPE correlate has famously been evident insingle units, representation of these more complex or subtleprediction errors may be an ensemble property.

      Perhaps some pattern analysis with fMRI would be able to say things about this...

    7. First, it naturally captures SPEs, as we will illus-trate shortly. Second, it also captures RPEs if reward is oneof the features.

      It can incorporate both SPE of SR and the RPE into one error signal?? -> would allow for cool modeling and dissociation tricks!

    8. expected TD error isthen proportional to the superposition of feature-specific TDerrors,PjdMt(j).

      This is a strong assumption but it is functional - Maybe we should incorporate such an encoding in our model/paper too?

    9. Althoughprediction errors are useful for updating estimates of thereward and transition functions used in model-based algor-ithms, these do not require a TD error.

      TD error is not useful for model-based updates, because these updates can be local and complete, at least as long as there are only local/short-term dependencies (e.g. Markov property in graph)

    10. building on the pioneeringwork of Suri [46], we argue that dopamine transients pre-viously understood to signal RPEs may instead constitutethe SPE signal used to update the SR

      Also a somewhat important reference!

    11. dopamine transients are necessary for learn-ing induced by unexpected changes in the sensory features ofexpected rewards [37]

      Good reference

    12. sensitive to movement-related variables such as action initiation and termination

      What do these have to do with value? Modulation based on course of action could be a signal that options are present...! => option-specific value-function

    13. some dopamine neuronsrespond to aversive stimuli

      The exact opposite of what you would expect if response indicates appetite - Seems more like an (unsigned?) prediction error then

    14. natomically segregated projection frommidbrain to striatum

      So DA could have the state-state learning signal, but then it would be segragated from the value-projections which run into PFC / ACC?

    15. value is affected by novelty [21] or uncertainty [22]

      Could be an actual modulator of value, or at least how much to learn from its encounter

    Annotators

    1. Finally, to test if the differences in these measures are sufficiently robust to allowcategorisation, we trained a support vector machine (SVM) classifier to accurately predict trajectories as either model-free, model-based or SR agents (Fig 6A-B). When the decoder was given data from the biological behaviour, the SVM classifier most frequently predicted those trajectories to be an SR agent

      This is a very cool idea! Train classifier on generated RL agent data, and let it then classify a batch of real data -> to what is it most similar?

    2. model-based algorithm was consistently the most successful

      But it was not most correlated with human/animal behaviour!

    3. attributes a role for dopamine in forming sensory predictionerrors (Gardner et al., 2018) - similar to what has been observed experimentally(Menegas et al., 2017; Sharpe et al., 2017; Takahashi et al., 2017).

      This seems like some stuff to check out!

    Annotators

    1. This abs(RPE) signal correlated with anumber of brain regions including an IPS locus anterior to wherewe found SPE correlates at p < 0.001 uncorrected (Figure S3).However, a direct comparison between SPE and abs(RPE) re-vealed a region of both posterior IPS that was significantly betterexplained by the SPE than by the abs(RPE) signal at p < 0.05corrected, as well as a region of latPFC that showed a differenceat p < 0.001 uncorrected

      So there is some degree of overlap: could it be dodgy?

    2. the degree to which pIPS encodes an SPE representationNeuronReward and State Prediction Errors in HumansNeuron66, 585–595, May 27, 2010ª2010 Elsevier Inc.589

      Behavioural relevance is established!

    3. partici-pants had indeed acquired knowledge about the particularsequence of state transitions during the first session: 99.6% ofpermutation samples provided a poorer explanation of choicesthan the original (p = 0.004).

      So earlier latent learning CAN cause for state transition model to emerge -> very useful, do not need to instruct

    4. If participants’ beliefsabout the transition probabilities were updated by error-drivenmodel-based learning (with a fixed learning rate, as assumedin FORWARD), this may have left a bias toward the most recentlyexperienced transitions

      A 'lingering effect' of recent learning, just like the successive contradictory feedback in Holroyd and Coles 2002

    5. exponential decay fromFORWARD to SARSA

      transition from MB to MF could be expected!

    6. volunteers were firstexposed to just the state space in the absence of any rewards,much as in a latent learning design. This provides a pure assess-ment of an SPE

      Is this similar to what Danesh did?

    1. “replay”trajectories that the animal has taken in the past, as well as to follow novel trajectories that the animal has neverbefore taken

      consolidation and integration?

    2. online planning process, perhaps similar to tree search or to online dynamic programming

      It does include forms of planning apprarently

    3. Theta sequences typically occur when an animal is moving

      So if theta synchronizes with the ACC, it can keep updating its environmental model according to task progression?

    4. monkeys use a joystick to control a digital predator pursuing digital prey

      This is in a sense also a foraging task because it is food-seeking? But i dont see effects like depletion and opporunity costs

    5. dlPFC activity correlates with a “state prediction error”

      but the unsigned version - so it is not goal-specific (and could it even be used to update SR in the right direction?)

    6. silencing OFC activity on a particular trialselectively impairs the influence of this expected outcome signal on learning but does not impair choosing.

      So the OFC really does seem to carry a learning signal as well!

    7. “inference”: the ability to combine separately-learned associationsin order to form new associations between items that may never have been encountered together before​

      This is basically association but for unseen combinations - shows a world model is abstracted from earlier knowledge

    8. associate stimuli or actions with specificexpected outcomes

      This is something ACC does very well

    9. Multi-stepplanning can be thought of as a process that uses this map to guide an extended sequence of behavior towards adistant goal

      The cognitive map is related to the HPC, and this is why it keeps popping up in ACC research as well -> it is very tightly linked to decision making and model building of also distal relationships!

    10. other lines
      1. Structure learning
      2. Foraging Both rely on models that can look at further horizons

    Annotators

    1. Notably, these signals were signed based on the subject’sgoal, consistent with a mechanism for determining how much toupdate beliefs and in which direction: toward confirmation (pos-itive) or reconsideration of one’s rewarding goal (negative)

      This seems like something very important to how ACC does model updating as well!

    2. but may not store the model locally

      This would be in the HPC right

    3. extracted the feedback-locked BOLD response in left lOFC (at 6 s post-feedback onset)at trialtand regressed this against the (signed) change to hippo-campal CSS (i.e., the change in the difference between LC andHC presentations in [ipsilateral] left hippocampus from the pre-ceding blockt0.5 to the subsequent blockt+ 0.5)

      Cool parametric analysis - How did (signed) prediction error influence the difference between LC and HC representations? (do they move further apart or closer together)

    4. the unsignedDKLterm, corresponding to the magni-tude of the belief update, independent of its direction, instead re-cruited a dorsal frontoparietal network, consistent with previousfindings related to unsigned state prediction errors during latentlearning

      So signing the Dkl term is important for identifying the ACC!? This means we have to include some sort of goal perhaps?

    5. stimulus-outcome update effects in lateral OFC/ventro-lateral prefrontal cortex (VLPFC) and also a distributed networkincluding anterior cingulate cortex, inferior temporal cortex, andposterior cingulate cortex

      Relevant: ACC is included in this! Hurray the experiment might be saved :D

    6. reduction in the BOLD response for HC items when comparedto LC items

      Because of accurate prediction -> previous activation? Or because in HC the next stimulus was already 'explained away' in a predictive processing sense!

    7. and the other inferred based on thesubject’s knowledge of the inverse relationship between stimuliand outcomes dictated by the task structure

      There is counterfactual updating possible in this task! Assumes internal model is accurate (but very likely)

    8. In each CSS block, each stimulus-outcome transition was pre-sented once

      So a suppressed representation is yielded for the outcome variable, both for the common and the rare transition

    9. firstselect the more desired gift card goal based on the current po-tential payouts and then reverse-infer the stimulus they believedwould most likely lead to that desired outcome

      So the reward is given, pps have to 'calculate' the optimal path -> separate reward function (given, unlearned) applied over learned model (OR SR!)

    10. but not about the reward amountobtained on a gift card

      Isolated from reward prediction error updating (?)

    11. advantageous to learn the transition probabilities

      This is important - emphasizing the structure of the problem

    12. reward-size-in-dependent stimulus-outcome associations

      We have to assess if this mimicks successor representations in some way

    13. little is known about how these different signalsare used in the brain

      There are many types of learning signals - prediction errors for all kinds of domains of cognition, not just reward or sensory. However, only for striatal dopamine (RPE) there is some knowledge of how it affects behaviour / other neural computation

    Annotators

    1. facilitate theupdating of internal models from which future action is gener-ated (55)

      Then could it also perform the function of clustering task sets based on outcomes? We have seen based on contextual cues it doesn't - that seems like a HPC thing

    2. The currentfindings suggest a specific computational functionfor the ACC: It is involved in updating internal models to fa-cilitate future information processing.

      A one-off update to guide future information processing -> it could happen through short term plasticity mechanisms?

    3. “reset”signal for internal models

      So following this theory, the model updating should actually have a positive effect on the pupil diameter because increase of LC-NE activity?

    4. Participants’internal models of the targets’distribution couldbe expected to differ from the true (generative) distribution

      Exactly: So how do you determine the model used by participants?

    5. dwell time reflectsupdating

      it was also seen in the behavioural Behrens paper that pps were slower after jumps?

    6. no further behavioral cost onsubsequent trials

      instant reprogramming?

    7. the two types could be easily distinguished

      minor confounding, difference could also draw some sort of bottom up attentional process?

    Annotators

    1. mPFC may track changes in activity patterns in regions with community-based representational similarity, providing a signal that could underlie parsing decisions.

      mPFC uses specifically the predictive structure to identify boundaries?

    2. three-layer neural network model (Fig. 6a). The network took input representing the current stimulus and was trained to predict which stimulus would occur next.

      predictive RNN -> should lead to successor rep?

    3. within the set of Hamiltonian paths, the probability of transitioning from one cluster boundary node (one of the pale nodes in Fig. 1a) to the adjacent one, if not yet visited, is always exactly

      otherwise the hamiltonian walk cannot be made

    4. passage into a new cluster significantly more often than at other times

      behavioural validation of latent learning!

    5. set of possible successor items on each step depends only on the current item, this uniformity in transition prob-abilities holds whether one takes into account only the most recent item or the n most recent items

      True uniformity! not some secret non-markov transition probability influence

    6. items will fall close together in representational space when they are preceded and followed by similar distributions of items in familiar sequence

      i.e. have similar successor representations.

    7. non-uniform transition probabilities.

      In surprise signals, elevation would happen when a rare transition occurs

    8. judgments are quite reliable

      People give very consistent answers to when events change -> pretty interesting?

    9. different account,

      They claim temporal community structure is DIFFERENT from surprise based event segmentation

    10. transient elevations in predictive uncertainty or surprise as the primary signal driving event segmentation.

      Surprise is potentially signaled in ACC -> HER Model?

    Annotators

    1. educed the total number of stimulus–stimulus transitions and therebyincreased statistical powe

      Clever experimental protocol

    2. the next day, subjects were presented with object sequences in the scanner

      So only the second test session was in the scanner!

    3. instructedto remember which of two buttons to press for a particular object orientation

      This is a simple behavioural measure

    4. prediction errors in the orbitofrontal cortex dur-ing active learning predict later changes in hippocampal representations of the stored model(Boorman et al., 2016)

      So it is OFC and not ACC?

    5. Indeed, we note that neural signals can be recorded in frontal and parietal cortices, reflectingthe ‘state-prediction errors’ that ensue when predicted state relationships are breached duringbehavioural control (Gla ̈scher et al., 2010)

      This seems very important

    6. We hypothesised that implicit knowledge about the graph structurewould influence response times, such that subjects would respond faster if a preceding object in thetest sequence was closer on the graph structure underlying the train sequence. Indeed, we foundthat log-transformed response times were longer the further away the preceding object was on thegraph (Figure 5C, D)

      Cool behavioural validation!

    7. communicabilitysignificantly distorts the graph structure by shortening links that form part of many paths around thegraph structure and lengthening links that would be less frequently visited by a random navigator

      So it is like a random exploration SR representation, but no need to fit gamma

    8. it was the symmetrised version alonethat predicted the fMRI suppression effect

      It is not the actual number of experienced transitions between graph nodes -> this is a good validation for the representation of an abstract relational structure, and that it's not some sort of experiential or temporal effect

    9. was also present behaviourall

      look at how they check this

    10. fMRI adaptation paradigm

      So it is different from representational similarity analysis!

    11. can be read out directly from functional magnetic resonance imaging(fMRI) data in the entorhinal cortex

      Readable coding from entorhinal cortex -> how to compare with goal as shown by Yoo et al (the neural basis of predictive pursuit)? Should be the same, but entorhinal vs dACC is different structure...?

    Annotators

    1. SR in com-plimentary learning systems, especially in the medial PFC and the hippocampus

      so this encapsulates ACC within it as well

    2. rela-tive balance between ‘state prediction errors’ and ‘reward prediction errors’ may be used for arbitration in an MB–MF hybrid learner

      Talk about state prediction errors in MF-MB literature?

    3. we chose to linearly combine the ratings from MB and SR algorithms

      So SR sort of becomes the replacement for MF -> explains why after proper learning / caching, increased alternative task demands don't have detrimental effect on MBness?

    4. response times were slower in the transition revaluation condition compared with both the reward revalu-ation condition (t57= 2.08, P< 0.05) and the control condition (t57= 4.04, P< 0.00

      This could be a signature some model-based planning is grinding the participants mental gears

    5. reward revaluation

      Now, they should prefer the other starting state

    6. transition revalua-tion condition

      So in SR this would not work! Long-run state caches is updated slowly

    7. indicate which starting state

      This is the actual performance -> Do they understand how starting state relates to reward at end of trajectory?

    8. indicate their preference

      So they had some mild agency / incentive to pay attention to the transitional structure! Maybe we can have specific states in some community (B) pay out a small reward, and later validate this with choices from community A -> to which would you transition? -> choose bottleneck to community B

      Preference check is attentional / structure learning check

    9. Experiment 1 used a passive learning task, which permitted the simplest possible test of the the-ory, removing the need to model action selection.

      Passive learning task of the SR

    10. Specifically, they learn and store a one-step internal representation or model of the short-term environmental dynamics: specifically, a state transition function T and a reward function R

      So learning T is substantially different from learning M. Learning T might not be more difficult, but using it for planning will take more time than using M. This is the benefit of SR -> computational time at decision time

    1. posterior dorsomedial striatum (Oh et al., 5312014; Hintiryan et al., 2016), a region necessary for learning and expression of goal directed action 532as assessed by outcome devaluation

      So learning signals could also come from here?

    2. Likewise, neuroimaging in a saccade task in which subjects constructed and updated a model of the 522location of target appearance observed ACC activation when subjects updated an internal model of 523where saccade targets were likely to appear, (O’Reilly et al., 2013)

      This is the update v no update oddball task

    3. , neuroimaging in the Daw two-step task has identified representation of model-based value 520in the BOLD signal in anterior- and mid-cingulate regions

      This is where ACC relevance in two-step task comes in explicitly!

    4. In 465humans, extensive training renders apparently model-based behaviour resistant to a cognitive load 466manipulation (Economides et al., 2015) which normally disrupts model-based control (Otto et al., 4672013), suggesting that it is possible to develop automatized strategies which closely resemble 468planning.

      This seems like a possible SR influence as well!

    5. ACC inhibition on model-based control because the effects would not survive multiple comparison 428correction for the large number of model parameters.

      So do not conclusively cite this study on anything specific

    6. xed and are known to be so by 302the human subjects

      So with fixed transition probabilities people will learn the interaction effect in the logistic regression. However, when participants also have to learn the transitional structure, the logistic regression will show effects separate for transitions and rewards, and no interaction!

    7. The absence of transition-outcome interaction has been used in the 298context of the traditional Daw two-step task (Daw 2011) to suggest that behaviour is model-free. 299However, we have previously shown (Akam et al. 2015) that this depends on the subjects not 300learning the transition probabilities from the transitions they experience.

      So this might be a crucial theoretical problem for typical 2-step tasks?

    8. direct biases of 286choice

      Some patterns in choices are simple response biases, but can mimick MBRL to some extent

    9. Subjects learned the task in 3 weeks

      This is where the difference between human subjects and animals comes into play: Humans can be instructed to ínstantly grasp the task. Will that be a major issue in performing a similar setup with humans?

    10. Reversals in which first-step action (high or low) had higher reward 216probability, could therefore occur either due to the reward probabilities of the second-step states 217reversing, or due to the transition probabilities linking the first-step actions to the second-step states 218reversing.

      That does seem like a convincing dissociation that the brain might want to keep separate!

    11. except on transitions to neutral 214blocks, 50% of which were accompanied by a change in the transition probabilities

      So sometimes here we have a double update required?

    12. At block transitions, either 213the reward probabilities or the transition probabilities changed

      so you have a dissociation between updates on poke transitions and reward outcomes from these pokes, thats clever

    13. introduction of reversals in the 189transition probabilities mapping the first-step actions to the second-step states. This step was taken 190to preclude subjects developing habitual strategies consisting of mappings from second-step states 191in which rewards had recently been obtained to specific actions at the first step (e.g. rewards in 192state X  chose action x, where action x is that which commonly leads to state X). Such strategies 193can, in principle, generate behaviour that looks very similar to model-based control despite not using 194a forward model which predicts the future state given chosen action (

      Is this the Dezfouli and Balleine style decisions they are referring to?

    14. block-based reward probability distribution

      Should promote task engagement how?

    15. in each second-step state there was a single action rather than a 181choice between two actions available, reducing the number of reward probabilities the subject must 182track from four to two

      This is how the 'burden' on participants is relieved - they can now focus more on learning the state transitions. But doesn't this identify the reached second stage with guaranteed reward, hence allowing direct encoding of reward to be a confound?

    16. model-based mechanisms 171which learn action-state transition probabilities and use these to guide choice.

      slightly different representation of the SR (in the plos paper it would be the H matrix of SR-Dyna?)

    17. optogenetic silencing of ACC neurons on individual trials reduced the influence of the 110experienced state transition on subsequent choice without affecting the influence of the trial 111outcome

      HEAVILY IMPLIES ACC AS LEARNING STATE-TO-STATE TRANSITIONS (signalling the error for SR updates?)

    18. in depth computational analysis 101(Akam et al., 2015

      Curious

    19. developing a new version in which both the reward probabilities in the leaf states of the decision 103tree and the action-state transition probabilities change over time.

      This seems like a sensible approach - now also transitions have to be learned. My understanding was no other experimentalists really did this because it would make the setup too involved and difficult for participants to grasp.

    20. (ACC), a region expected to be centrally involved

      ACC is expected to be centrally involved in the two-step decision task?

    21. representing task contingencies beyond model-57free cached values

      This is ALSO an SR feature!

    22. Firstly, the ACC provides a massive input to posterior dorsomedial 54striatum (Oh et al., 2014; Hintiryan et al., 2016), a region critical for model-based control as assessed 55through outcome-devaluation

      Outcome devaluation is also present in SR however!

    1. failurestoflexiblyupdatedeci-sionpoliciesthatarecausedbycachingofeitherthesuccessorrepresentation(asinSR-TDorSR-Dynawithinsufficientreplay)ora decisionpolicy(asinSR-MB)shouldbeaccompaniedbyneuralmarkersofnon-updatedfuturestateoccupancypredictions

      Good basis for some kind of experiment?

    2. unlikethehippocampus,partsofthePFCappeartobeinvolvedinactionrepresentationinadditiontostaterepresentation

      This could be relevant for the SR-Dyna model where there is H(a,s) matrix

    3. valueweightswouldbeleanedbyneuronsconnectingthehippocampustoventralstriatum,inthesameTDmannerdiscussedinthispaper

      Or perhaps a detectable error with ERP produced in ACC?

    4. fMRImeasuresoftherepresentationofvisualstimuliintaskswheresuchstimuliarepresentedsequentially

      Paradigms to measure SR with fMRI!

    5. [74]demonstratedthata sophisticatedrepresentationthatincludesrewardhistorycanpro-ducemodel-basedlikebehaviorinthetwo-steprewardrevaluationtask

      Could we be able to decode first-stage action at second stage decision time from ACC potentially? Maybe when trained with an RNN?

    6. moresophisticatedstaterepresentations

      This is always going to be an ill-defined potential confound in any RL research I believe

    7. predictionerrorrelatedBOLDsignalsinhumans

      How do we find this? :D

    8. nsteadstoredinprefrontalcortex

      Would make sense as the seat of planning etc

    9. successormatrixupdatedbySR-Dynamightitselfexistintherecurrentconnectionsofhippocampalneurons

      There is already a paper on this?

    10. SR-Dynacansupportrapidactionselectionbyinspectingitslookuptable

      It basically does tree-search beforehand, and caches the values it finds from the tree search, so they can be applied directly at task-time

    11. ThissimulationdemonstratesthatSR-Dynacanthusproducebehavioridenticalto“full”model-basedvalueiterationinthistask

      And it is still computed using DA plausible TD techniques. Only the representational space gets more and more complex, and the Replay mechanic gets added.

    12. recurrentneuralnetworksoffera simplewaytocomputeMπ(s,:)basedonspreadingactivationimplementingEq11.

      This is beautiful information for us!

    13. SR-MBcannotsolvethenovel“policy”revaluationtask

      But it also depends on its exploration strategy if this is continuously active

    14. SR-TDandSR-MBarethus“on-policy”methods–theirestimatesofVπcanbecomparedtotheestimatesofa traditionalmodel-basedapproach

      So they do not generalize to other policies necessarily! Though M learned under a uniform, fully random policy should consititute the accurate transition model?

    15. SR-MBlearnsa one-steptransitionmodel,Tπandusesit,atdecisiontime,toderivea solutiontoEq9

      With eq.9. being the solution for M. So in essence it is a 'double' SR? It holds a model for the long run expectancies in M, but which is composed from the one-step expectancies learned in T

    16. combinetheSRwithfunctionapproximationanddistributedrepresentations

      Function approximation == Neural Networks Distributed Representations == What we are looking into!

    17. Crucially,despitethefunctionalsimi-laritybetweenthisruleandtheTDupdateprescribedtodopamine,wedonotsuggestthatdopaminecarriesthisseconderrorsignal

      So no error detectable with ERP technique will probably guide us in this direction of SR-TD? Depends on content of paper in ref 55.

    18. standarddopaminergicTDrule

      This speaks for the benefits of SR-TD

    19. agentis allowedtoexperiencethischangeonlylocally

      Local exploration of introduced blockade -> correct representation only for entries that were experienced

    20. BecauseMπreflectslong-runcumulativestateoccupancies,ratherthantheindividualone-steptransitiondistribution,P(s’|s,a),SR-TDcannotadjustitsvaluationstolocalchangesinthetransitionswithoutfirstupdat-ingMπatdifferentlocations.

      A 'smarter' algorithm could evaluate M completely if it learns a local change, which could even be learned through TD (only relevant vector s -> s').

    21. SR-TDcan,withoutfurtherlearning,producea newpolicyreflectingtheshortestpathtotherewardedlocation

      And then, using learned M(pi) from random policy and rewarded state by direct placement, it can compute shortest path from any random initialization!

    22. firstexploresthegrid-worldran-domly,duringwhichit learnsthesuccessormatrixMπcorrespondingtoa randompolicy

      It can learn M(pi) according to a random policy without any reward entering the system ever!

    23. gMpðs0;:Þ

      So it makes M(s,:) a little bit more similar to M(s',:)? But it will become an aggregate of all possible s' since you could transition to any (with a certain probability) -> weighted exactly by transition structure! Weighting happens slowly over time hence source of inflexibility to transitioning.

    24. Thethreemodels

      So actually, for the learning of M, three different models are proposed in this paper! There is not simply one learning algorithm involved in this SR thing.

    25. Tπis theone-stepstatetransitionmatrixthatis dependentonπ

      Policy dependence = online?

    26. approximationwillbecorrectwhentheweightw(s0) foreachsuccessorstatecorrespondstoitsone-stepreward,averagedoveractionsins

      Why average over the actions and not pick the max?

    27. learnedrewardvaluesinventralstriatum[53]

      Already a learning signal for successor representations as potentially seen in HPC?

    28. Doya[51]introduceda circuitbywhichprojectionsviathecerebellumper-formonestepofforwardstateprediction,whichactivatesa dopaminergicpredictionerrorfortheanticipatedstate

      State prediction error coming from cerebellum?

    29. andneuro-imagingofpredictionerrorsignalsinhumanstriatum[6]

      Detecting prediction errors with fMRI?

    30. thusper-hapsinvolvinganalogous(striatal)computationsoperatingoverdistinct(cortical)inputrepre-sentations[47].

      Would make sense, TD errors and other learning signals could be used by many systems for many different kind of learning or updating

    31. Typically,model-basedmethodsareoff-policy(sincehavinglearnedaone-stepmodelit is possibletouseEq2 todirectlycomputetheoptimalpolicy);whereasdif-ferentTDlearningvariantscanbeeitheron-oroff-policy.

      Model-based is considered off-policy, because we have a pretty complete representation of the environment, which is not biased by taken actions. If the relevant mappings of actions make a difference for the estimation, it would be considered on-policy.

    32. butalsointhemoreflexi-blechoiceadjustmentsthatseemtoreflectmodel-basedlearning

      E.g. error updating from counterfactuals, which suggests model-based inference of reward that was not obtained!

    Annotators

    1. n previousversions, subjects at each stage chose between two symbols insteadof two fixed actions and the symbols moved from side to side ateach trial ensuring there was no consistent mapping between thebutton presses and the symbols

      This could be a major difference for the paradigm! Now the actions are mapped directly and not through an environmental link

    2. decisions that are insensitive to (i) the valuesof the outcomes [8] and (ii) the contingency between specificactions and their outcomes

      Clear statement that HRL sequences can become insensitive to the transition structure of the problem at hand

    3. abitual (when actionsequences are selected) and goal-directed (when single actions areselected) action

      This is a different 'habitual' than the MF-RL implied habit

    4. a first stage action is the best action

      Here they do analyze the difference in expected reward from different first stage actions. Danesh paper dismissed this as intractable for participants

    5. with a small probability (1:7), the rewardingprobability of each key changed randomly to either the high or lowprobability.

      So not the slow drifting as in Daw and in Danesh paper

    Annotators