 Last 7 days

proceedings.neurips.cc proceedings.neurips.cc

denote dimensions0 through i − 1 of the state
Very odd/interesting! dimensions are independent but we are doing them in order?

τ<t to denote a trajectory from timesteps 0 through t − 1
tau<t short hand for all the previous s_t1, a_t1 etc.

lowerdiagonal attention mask
why lowerdiagonal?

Transformer architectures feature a “causal” attentionmask to ensure that predictions only depend on previous tokens in a sequence
Causal is in quotes here for a good reason. It is called causal attention mask in the LLM literature, but it has only to do with the probaility of the next token/word. It isn't attached to the meaning of the words at all.

We can use this directly as a goalreaching method by conditioning on a desired final state sT .
interesting, goal directed RL cast as a sequence of samples from conditional probabilities

If we set the predictedsequence length to be the action dimension, our approach corresponds exactly to the simplest form ofbehavior cloning with an autoregressive policy
why is that? because the sample from the actions will be a proper sample? why would the sequence length ever be larger then?

Pθ (·  x)
where does the distritbuion come from initially? empircal?

Uniform discretization has the advantage that it retains information about Euclidean distance inthe original continuous space, which may be more reflective of the structure of a problem than thetraining data distribution.
always important to consider, if the relative magnitudes between points is important

modeling considerations are concerned lesswith architecture design and more with how to represent trajectory data – potentially consisting ofcontinuous states and actions – for processing by a discretetoken architecture
They don't care what kind of transformer is being used, they are interested in how to get SASASASA into the right form.
good question: what about continuous states and/or actions?

Concurrently with our work, Chen et al. (2021) also proposed an RL approach centered aroundsequence prediction, focusing on reward conditioning as opposed to the beamsearchbased planningused by the Trajectory Transformer.
This is the Decision Transformer paper we read last week

Modeling the states and actions jointly already provides a biastoward generating indistribution actions, which avoids the need for explicit pessimism
pessimism is a popular method to avoid (overfitting?) of the learned dynamics to what you saw. Since transformers maintain a huge context, this isn't needded, the predictions will always be tied to the same situation as in the training data

modelbased RL
learn the dynamics, then optimize via RL

stimateconditional distributions over actions
policy as a distribution over actions

While such works demonstrate the importance of such models for representingmemory (Oh et al., 2016), they still rely on standard RL algorithmic advances to improve performance
is the sequence modeling for just learning the model or is it deeper?

The Trajectory Transformeris a substantially more reliable longhorizon predictor than conventional dynamics models
So the TT becomes a new type of model based RL

When decoded with a modified beam search procedure that biases trajectory samples according totheir cumulative reward,
so beam search is just a decoder of the learned dynamics that optimizes for reward?

 Nov 2023

proceedings.mlr.press proceedings.mlr.press

Reading this one on Nov 27, 2023 for the reading group.


proceedings.neurips.cc proceedings.neurips.cc

Reading this one on Nov 27, 2023 for the reading group.


arxiv.org arxiv.org

K = 50 for Pong, K = 30 for others
**Q: ** where did these numbers come from

loss = mean (( a_preds  a )**2)
supervised learning for RL task

We feed the last K timesteps into Decision Transformer, for a total of 3K tokens (onefor each modality: returntogo, state, or action)
Data  K timesteps with three tokens per timestep  returntogo token  state token embedding  action token  token embedding for each token  linear (or convolutional) layer to learn  normalize  timestep embedding  embedding of the time index itself, adjusting for 3x?  question: added or concatenated? is timestep embedding on the raw tokens or on the emebdding?

his suggests that in scenarios with relativelylow amounts of data, Decision Transformer can outperform %BC by using all trajectories in thedataset to improve generalization, even if those trajectories are dissimilar from the return conditioningtarget. Our results indicate that Decision Transformer can be more effective than simply performingimitation learning on a subset of the dataset. On the tasks we considered, Decision Transformer eitheroutperforms or is competitive to %BC, without the confound of having to select the optimal subset
So it seems like it isn't just behaviour cloning

Does Decision Transformer perform behavior cloning on a subset of the data?
good questions

we use the GPT architecture [ 9 ], which modifies the transformer architecture with a causal selfattention mask to enable autoregressive generation, replacing the summation/softmax over the ntokens with only the previous tokens in the sequence (j ∈ [1, i]).
This sentence is working hard.

this allows the layer to assign “credit” by implicitly forming statereturnassociations via similarity of the query and key vectors (maximizing the dot product)
that's a different way of thinking about what's happening in a transformer.


arxiv.org arxiv.org

We then use a similar QA summarization framework as Wu et al. (2023) which produces QA dialogueon game mechanics
Q: what was the main focus of this paper?
A: "Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals"
Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates objectagent interactions based on information from the manual

LATEX source code
Q: why are they using the source code and not the text output?

all prior works require expert or human generated example trajectories
Training the LLMs using generated trajectories.

Wu et al. (2023) proposes a summary (Read) and reasoning (Reward) through a QA promptingframework with an opensource QA LLM Tafjord and Clark (2021). The framework demonstratesthe possibility of an using realworld humanwritten manuals to improve RL performance on populargames, despite limiting the interaction types to only “hit”. Our framework handles all 17 kinds ofinteractions available in the game. Moreover, our framework makes use of information on techtreedependencies, and suggestions on desired policies extracted from the academic paper
Main paper they are based on.

Indicate their priority out of 5
Q: Where does "priority" even come from for the LLM for a domain like this? What prior knowledge and biases are built in here?

The visual descriptor takes the last two gameplay screens as input, andoutputs their descriptions in language (dt, dt−1)
Q: so does the language it uses internally keep changing?

Answer to the final question qa is mapped to environment action usingsubstring matching.
Q: is this explained in more detail anywhere?

Experimentally, we find that prompting the LLM with only the direct parents of a question greatlyreduces the context length, and helps LLM to focus on the most relevant contextual information
Interesting: What is being given up here? You need to cut or summarize context at some point for sure. But when?

modelbased methods like DreamerV2 Hafner et al. (2020);DreamerV3 Hafner et al. (2023)
Summary: how do these methods work?

We add the prompt “DO NOT answer inLaTeX.” to all of Qgame to prevent the LLM from outputting the list in LATEX format
does GPT 3.5 understand latex that well?

in an environmentwhere control tasks are less required
Q: what do they mean by this?

zeroshot LLMbased (GPT4) policy
What does "zeroshot" mean when it involves an LLM?

,we promote and regulate incontext chainofthought reasoning in LLMs to solvecomplex games. The reasoning module is a directed acyclic graph (DAG), with questions as nodesand dependencies as edges. For example, the question “For each action, are the requirements met?"depends on the question “What are the top 5 actions?", creating an edge from the latter to the former.For each environment step, we traverse the DAG computing LLM answers for each node in thetopological order of the graph. The final node of the DAG is a question about the best action to takeand the LLM answer for the question is directly translated to environment action
seems sensible

decidingthe paragraphs that are relevant for playing the game
this could be very subjective

the environment is OOD to them.
Translation: the Crafter game is too new for GPT to know about

 Oct 2023

arxiv.org arxiv.org

In a nutshell, the CHT seems to disprove the scaling hypothesis.Or does it? In this work, we argue that foundation models might be exploiting a “loop hole” in the CHT4.Namely, what happens if the causal assumptions (which are required, by the CHT, for causal inference) arerepresented in observational data itself?
Are LLMs exploiting a loophole in Pearl's ladder?
It's not really a loophole, it's just that observational dataset that explicitely contains answers to your interventional queries.

Plato. Republic: Allegory of the cave, 375 BC
ok, you win.

Same Implication, Different Representations
Big Question: they cover text and experiment, but what about embodied experience? What is it's role? We believe in causality for very visceral (ie. physical and unavoidable) reasons as human beings.
eg. we touch a hot stove and then it hurts

we expect P (YX←1 = 1) = P (Y = 1) since intervening on X will notchange Y
Q: is that correct? wouldn't you need to show the \(X\leftarrow 0\) case to demonstrate this?

the probability of a high number of Nobel laureates if the given chocolate consumption were to behigh.
example of an L2 interventional query.
Q: For this query \(P(Y){x\leftarrow 1}=1\) wouldn't the more correct english translation be:
"The probability of having a high number of Nobel laureates if high chocolate consumption was made mandatory."

We call these concepts ‘meta’ since they are one level above ‘regular’, simple SCM in thesense that they encode information about answering causal questions in another SCM.
keep reading this sentence until it makes sense...or argue why it doesn't make sense

More intriguingly, it does not matter where that L2 fact comes from since the formulation is independent ofwhether the model learns the fact and simply requires that the model knows about the fact. We state oursecond key insight as
good point to remember, we don't need to learn everything, some knowledge can be encoded directly, a priori.

Example 1 serves to show how the rather abstract definition of an SCM can be made tangible to communicatewhat we believe about our observed data and more so the underlying data generating process.
Does everyone agree that it's crystal clear now? (maybe not...)

The Pearl’s Causal Hierarchy
An important theoretical framework to read up on if you aren't familiar with it.

It is clear how the observed correlation in this case corresponds to a directcausation according to
We should draw these models out

These models are castles in theair. They have no foundations whatsoever.” discrediting the models for lacking any identifiable notion tocausality.
discussion: Do we really need to just pick one of these options?

Our explanation for this is that they are not only ‘stochastic parrots’ as already suggested by Benderet al. (2021) but sometimes also ‘causal parrots’ since they will also encounter correlations over causal factsduring training in their vast oceans of textual data.
Q: what wsa Bender's argument exactly?

parameterizedvariants of SCMs such as the neural ones presented in (Xia et al., 2021
to read: this sounds interesting

y meta SCM
Q: definition needed

However, this conclusion is arguably nothing new, as most people wouldagree, and this is partly so because such obtained knowledge has been embedded as textual articles into encyclopedias such as Wikipedia, which are freely accessibl
Bit strange: this sounds like they are saying people know this because of wikipedia, rather than from lived experience.

IPEEE denotes the exogenousdistribution
Q: Can we get a definition of this?

to our real worldintuition since there is a bidirected edge X ↔ Y ∈ G(M2) with E3 being the underlying confounder
**Intuition: ** whatever explains GDP, we call E3, that also explains X and Y.

The following block paragraph serves as a summary
question: where does this paragraph come from? who wrote it?

we take the former perspectivepro causal AI/ML. We argue that the questions around causality can fuel research also on questions of recentdebates such as how much ‘real’ progress towards AGI has been made since the advent of large scale models
I would agree with this stance!

counteringopinions start to speak out against causal AI/ML (Bishop, 2021)
Should we read this paper as well? Is there an updated paper or opinion piece from these researchers about why causal AI/ML isn't needed?

Zecevic, Willig, Singh Dhami and Kersting. "Causal Parrots: Large Language Models May Talk Causality But Are Not Causal". In Transactions on Machine Learning Research, Aug, 2023.


arxiv.org arxiv.org

Introduction of the RoBERTa improved analysis and training approach to BERT NLP models.


arxiv.org arxiv.org

(Chen, NeurIPS, 2021) Che1, Lu, Rajeswaran, Lee, Grover, Laskin, Abbeel, Srinivas, and Mordatch. "Decision Transformer: Reinforcement Learning via Sequence Modeling". Arxiv preprint rXiv:2106.01345v2, June, 2021.
Quickly a very influential paper with a new idea of how to learn generative models of action prediction using SARSA training from demonstration trajectories. No optimization of actions or rewards, but target reward is an input.


proceedings.mlr.press proceedings.mlr.press

Kallus, N. (2020). DeepMatch: Balancing deep covariate representations for causal inference using adversarial training. In I. H. Daumé, & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning. In Proceedings of Machine Learning Research: vol. 119 (pp. 5067–5077). PMLR

Using adversarial deep learning approaches to get a better correction for causal inference from observational data.


arxiv.org arxiv.org

"Causal Deep Learning" Authors:Jeroen Berrevoets, Krzysztof Kacprzyk, Zhaozhi Qian, Mihaela van der Schaar
Very general and ambitious approach for representing the full continuous conceptual spectrum of Pearl's Causal Ladder, and ability to model and learning parts of this from Data.
Discussed by Prof. van der Shaar at ICML2023 workshop on Counterfactuals.
Tags
Annotators
URL


www.nature.com www.nature.com

Performing optimization in the latent space can more flexibly model underlying data distributions than mechanistic approaches in the original hypothesis space. However, extrapolative prediction in sparsely explored regions of the hypothesis space can be poor. In many scientific disciplines, hypothesis spaces can be vastly larger than what can be examined through experimentation. For instance, it is estimated that there are approximately 1060 molecules, whereas even the largest chemical libraries contain fewer than 1010 molecules12,159. Therefore, there is a pressing need for methods to efficiently search through and identify highquality candidate solutions in these largely unexplored regions.
Question: how does this notion of hypothesis space relate to causal inference and reasoning?

Wang et. al. "Scientific discovery in the age of artificial intelligence", Nature, 2023.
A paper about the current state of using AI/ML for scientific discovery, connected with the AI4Science workshops at major conferences.
(NOTE: since Springer/Nature don't allow public pdfs to be linked without a paywall, we can't use hypothesis directly on the pdf of the paper, this link is to the website version of it which is what we'll use to guide discussion during the reading group.)

Petersen, B. K. et al. Deep symbolic regression: recovering mathematical expressions from data via riskseeking policy gradients. In International Conference on Learning Representations (2020).
Description: Reinforcement learning uses neural networks to generate a mathematical expression sequentially by adding mathematical symbols from a predefined vocabulary and using the learned policy to decide which notation symbol to be added next. The mathematical formula is represented as a parse tree. The learned policy takes the parse tree as input to determine what leaf node to expand and what notation (from the vocabulary) to add.

Reinforcement learning uses neural networks to generate a mathematical expression sequentially by adding mathematical symbols from a predefined vocabulary and using the learned policy to decide which notation symbol to be added next140. The mathematical formula is represented as a parse tree. The learned policy takes the parse tree as input to determine what leaf node to expand and what notation (from the vocabulary) to add
very interesting approach

In chemistry, models such as simplified molecularinput lineentry system (SMILES)VAE155 can transform SMILES strings, which are molecular notations of chemical structures in the form of a discrete series of symbols that computers can easily understand, into a differentiable latent space that can be optimized using Bayesian optimization techniques (Fig. 3c).
This could be useful for chemistry research for robotic labs.

Neural operators are guaranteed to be discretization invariant, meaning that they can work on any discretization of inputs and converge to a limit upon mesh refinement. Once neural operators are trained, they can be evaluated at any resolution without the need for retraining. In contrast, the performance of standard neural networks can degrade when data resolution during deployment changes from model training.
Look this up: anyone familiar with this? sounds complicated but very promising for domains with a large range of resolutions (medicalimaging, wildfiremanagement)

Standard neural network models can be inadequate for scientific applications as they assume a fixed data discretization. This approach is unsuitable for many scientific datasets collected at varying resolutions and grids.
Is discretized resolution of neural networks an issue for science?

generating hypotheses
Are any of the "generated hypotheses" more general than a molecular shape? Are they full hypothetical explanations for a problem? (yes)

Applications of symbolic regression in physics use grammar VAEs150. These models represent discrete symbolic expressions as parse trees using contextfree grammar and map the trees into a differentiable latent space. Bayesian optimization is then employed to optimize the latent space for symbolic laws while ensuring that the expressions are syntactically valid. In a related study, Brunton and colleagues151 introduced a method for differentiating symbolic rules by assigning trainable weights to predefined basis functions. Sparse regression was used to select a linear combination of the basis functions that accurately represented the dynamic system while maintaining compactness. Unlike equivariant neural networks, which use a predefined inductive bias to enforce symmetry, symmetry can be discovered as the characteristic behaviour of a domain. For instance, Liu and Tegmark152 described asymmetry as a smooth loss function and minimized the loss function to extract previously unknown symmetries. This approach was applied to uncover hidden symmetries in blackhole waveform datasets, revealing unexpected space–time structures that were historically challenging to find.
This seems very important, even though I only understand half of it. My question is, can similar approaches be used to apply to planning in complex domains or to meaning and truth in language?

to address the difficulties that scientists care about, the development and evaluation of AI methods must be done in realworld scenarios, such as plausibly realizable synthesis paths in drug design217,218, and include well calibrated uncertainty estimators to assess the model’s reliability before transitioning it to realworld implementation
It's important to move beyond toy models.

However, current transferlearning schemes can be ad hoc, lack theoretical guidance213 and are vulnerable to shifts in underlying distributions214. Although preliminary attempts have addressed this challenge215,216, more exploration is needed to systematically measure transferability across domains and prevent negative transfer.
There is still a lot of work to do to know how to best use human knowledge to guide learning systems and how to reuse models in different domains.

Another approach for using neural networks to solve mathematical problems is transforming a mathematical formula into a binary sequence of symbols. A neural network policy can then probabilistically and sequentially grow the sequence one binary character at a time6. By designing a reward that measures the ability to refute the conjecture, this approach can find a refutation to a mathematical conjecture without prior knowledge about the mathematical problem.
A nice idea to learn a formula of symbols which can be evaluated logically for truth. But do they mention more general approaches such as using SAT solvers for this task? See Vijay Ganesh work.

foresighted
is "foresighted" a word?

AI methods have become invaluable when hypotheses involve complex objects such as molecules. For instance, in protein folding, AlphaFold210 can predict the 3D atom coordinates of proteins from amino acid sequences with atomic accuracy, even for proteins whose structure is unlike any of the proteins in the training dataset.
This is an important category, but it can't apply to all fields and will have a limit to what it can do to move science forward. It's also very dependent on vast computing resources.

Transformer architectures
Question: what is the inductive bias of Transformers for NLP? Can we define the symmetries that are implicitly leveraged in the architecture.

Such pretrained models96,97,98 with a broad understanding of a scientific domain are generalpurpose predictors that can be adapted for various tasks, thereby improving label efficiency and surpassing purely supervised methods8.
Pretrained models: these are obviously important and powerful, they almost always work better than training from scratch.
generalpurpose predictors: However, we should be suspicious of accepting this claim that they are general purpose predictors. Why?
 Have all of the scenarios been tested?
 Does the system have a general underlying model?
 Is there some bias in the training and testing data?
Example:  you pretrain a model on motion of objects on a plane, such a pool table. You learn a very good model to predict movement.  Now, does it work if the table is curved? or even has bumps and imperfections?  Now train it on 3D Netwonian examples, will it predict relativitistic effects? (No)

In the analysis of scientific images, objects do not change when translated in the image, meaning that image segmentation masks are translationally equivariant as they change equivalently when input pixels are translated.
an example of symmetry

Symmetry is a widely studied concept in geometry69. It can be described in terms of invariance and equivariance (Box 1) to represent the behaviour of a mathematical function, such as a neural feature encoder, under a group of transformations, such as the SE(3) group in rigid body dynamics.
Symmetry is a very broad concept even beyond geometry, although that is the easiest area to think about. If you are interested, it is worth looking into category theory and symmetry more generally. If you can find a type of symmetry that no one has, for a meaningful categorical/geometric pattern that relates to a real type of data, task or domain, then you might be able to start the next new architecture revolution.

Another strategy for data labelling leverages surrogate models trained on manually labelled data to annotate unlabelled samples and uses these predicted pseudolabels to supervise downstream predictive models.
This kind of bootstrapping of human labelling is what made ChatGPT (v3) break through the level of coherence that caused so much excitement in Nov 2022 and afterwards.
It is also becoming a very common strategy, seemingly replacing an entire industry of full human labelling, with a more focussed process of labellearnpseudolabelrefinerepeat.

To identify rare events for future scientific enquiry, deeplearning methods18 replace preprogrammed hardware event triggers with algorithms that search for outlying signals to detect unforeseen or rare phenomena
The importance of filtering out irrelevant data.

Recent findings demonstrate the potential for unsupervised language AI models to capture complex scientific concepts15, such as the periodic table, and predict applications of functional materials years before their discovery, suggesting that latent knowledge regarding future discoveries may be embedded in past publications.
This is one I often point to and wasn't even using the latest transformer approach to language modelling.

inductive biases (Box 1), which are assumptions representing structure, symmetry, constraints and prior knowledge as compact mathematical statements. However, applying these laws can lead to equations that are too complex for humans to solve, even with traditional numerical methods9. An emerging approach is incorporating scientific knowledge into AI models by including information about fundamental equations, such as the laws of physics or principles of molecular structure and binding in protein folding. Such inductive biases can enhance AI models by reducing the number of training examples needed to achieve the same level of accuracy10 and scaling analyses to a vast space of unexplored scientific hypotheses11.
Inductive biases: these are becoming more and more critical to understand, and are a good place for academic researchers to focus for new advances, since they don't generally depend on scale or vast amounts of data. These are fundamental insights into the symmetries and structure of a domain, task or architecture.

Box 1 Glossary
A good set of definitions of various terms.

and coupled with new algorithms
almost an afterthought here, I would cast it differently, the new algorithms are a major part of it as well.
Listed algorithm types: * geometric deep learning * selfsupervised learning of foundation models * generative models * reinforcement learning

geometric deep learning (Box 1) has proved to be helpful in integrating scientific knowledge, presented as compact mathematical statements of physical relationships, prior distributions, constraints and other complex descriptors, such as the geometry of atoms in molecules
geometric deep learning : An interesting broad category for graph learning and other methods, is this a common way to refer to this subfield?


arxiv.org arxiv.org

[ Bengio, The Consciousness Prior, Arxiv, 2018]
Tags
Annotators
URL


arxiv.org arxiv.org

Causal Deep Learning Authors:Jeroen Berrevoets, Krzysztof Kacprzyk, Zhaozhi Qian, Mihaela van der Schaar
Very general and ambitious approach for representing the full continuous conceptual spectrum of Pearl's Causal Ladder, and ability to model and learning parts of this from Data.
Discussed by Prof. van der Shaar at ICML2023 workshop on Counterfactuals.
Tags
Annotators
URL


arxiv.org arxiv.org

(Cousineau,Verter, Murphy and Pineau, 2023) " Estimating causal effects with optimizationbased methods: A review and empirical comparison"

Biasvariance tradeoff
The Bias  Variance Tradeoff!
Tags
Annotators
URL


oid.wharton.upenn.edu oid.wharton.upenn.edu

To avoid such bias, a fundamental aspect in the research design of studies of causalinference is the identification strategy: a clear definition of the sources of variation in the datathat can be used to estimate the causal effect of interest.
To avoid making false conclusions, studies must identify all the sources of variation. Is this is even possible in most caes?

Matching: This approach seeks to replicate a balanced experimental design usingobservational data by finding close matches between pairs or groups of units andseparating out the ones that received a specified treatment from those that did not, thusdefining the control groups.
Matching approach to dealing with sampling bias. Basically use some intrinsic, or other, metric about the situations to cluster them so that "similar" situations will be dealt with similiarly. Then analysis is carried out on those clusters. Number of clusters has to be defined, some method, like kmeans, if often used. Depends a lot on the similarity metric, the clustering approach, other assumptions

Terwiesch, 2022  "A review of Empircal Operations Managment over the Last Two Decades" Listed as an important review of methods for addressing biases in Operations management by explicitly addressing causality.


openreview.net openreview.net

Shayan Shirahmad Gale Bagi, Zahra Gharaee, Oliver Schulte, and Mark Crowley Generative Causal Representation Learning for OutofDistribution Motion Forecasting In International Conference on Machine Learning (ICML). Honolulu, Hawaii, USA. Jul, 2023.


arxiv.org arxiv.org

"Causal Triplet: An Open Challenge for Interventioncentric Causal Representation Learning" Yuejiang Liu1, 2,* YUEJIANG.LIU@EPFL.CH Alexandre Alahi2 ALEXANDRE.ALAHI@EPFL.CH Chris Russell1 CMRUSS@AMAZON.DE Max Horn1 HORNMAX@AMAZON.DE Dominik Zietlow1 ZIETLD@AMAZON.DE Bernhard Sch ̈olkopf1, 3 BS@TUEBINGEN.MPG.DE Francesco Locatello1 LOCATELF@AMAZON.DE


arxiv.org arxiv.org

Wu, Prabhumoye, Yeon Min, Bisk, Salakhutdinov, Azaria, Mitchell and Li. "SPRING: GPT4 Outperforms RL Algorithms byStudying Papers and Reasoning". Arxiv preprint arXiv:2305.15486v2, May, 2023.

Quantitatively, SPRING with GPT4 outperforms all stateoftheart RLbaselines, trained for 1M steps, without any training.
Them's fighten' words!
I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any fromscratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.


link.springer.com link.springer.com

Chapter 21 "Adversarial Autonencoders" from our book "Elements of Dimensionality Reduction and Manifold Learning", Springer 2023.


assets.pubpub.org assets.pubpub.org

Discussion of the paper:
Ghojogh B, Ghodsi A, Karray F, Crowley M. Theoretical Connection between Locally Linear Embedding, Factor Analysis, and Probabilistic PCA. Proceedings of the Canadian Conference on Artificial Intelligence [Internet]. 2022 May 27; Available from: https://caiac.pubpub.org/pub/7eqtuyyc


www.gatesnotes.com www.gatesnotes.com

"The Age of AI has begun : Artificial intelligence is as revolutionary as mobile phones and the Internet." Bill Gates, March 21, 2023. GatesNotes


www.inc.com www.inc.com

Minda Zetlin. "Bill Gates Says We're Witnessing a 'Stunning' New Technology Age. 5 Ways You Must Prepare Now". Inc.com, March 2023.



It should not be used as a primary decisionmaking tool, but instead as a complement to other methods of determining the source of a piece of text.
This is true of any of these LLM models actually for any task.


arxiv.org arxiv.org

Feng, 2022. "TrainingFree Structured Diffusion Guidance for Compositional TexttoImage Synthesis"
Shared and found via: Gowthami Somepalli @gowthami@sigmoid.social Mastodon > Gowthami Somepalli @gowthami StructureDiffusion: Improve the compositional generation capabilities of texttoimage #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.


www.semanticscholar.org www.semanticscholar.org

Training language models to follow instructionswith human feedback
Original Paper for discussion of the Reinforcement Learning with Human Feedback algorithm.


arxiv.org arxiv.org

[Kapturowski, DeepMind, Sep 2022] "Humanlevel Atari 200x Faster"
Improving the 2020 Agent57 performance to be more efficeint.


d4mucfpksywv.cloudfront.net d4mucfpksywv.cloudfront.net

GPT2 Introduction paper
Language Models are Unsupervised Multitask Learners A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, (2019).


arxiv.org arxiv.org

"Attention is All You Need" Foundational paper introducing the Transformer Architecture.



GPT3 introduction paper


arxiv.org arxiv.org

"Are Pretrained Convolutions Better than Pretrained Transformers?"


arxiv.org arxiv.org

LaMDA: Language Models for Dialog Application
"LaMDA: Language Models for Dialog Application" Meta's introduction of LaMDA v1 Large Language Model.



Benyamin GhojoghAli Ghodsi. "Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey"

 Sep 2023

arxiv.org arxiv.org

Adaptive Stress Testing with Reward Augmentation for Autonomous Vehicle Validation

 Aug 2023

arxiv.org arxiv.org

Title: Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics Authors: Michael Y. Hu1 Angelica Chen1 Naomi Saphra1 Kyunghyun Cho Note: This paper seems cool, using older interpretable machine learning models, graphical models to understand what is going on inside a deep neural network

 Jul 2023

arxiv.org arxiv.org

“Rung 1.5” Pearl’s ladder of causation [1, 10] ranks structures in a similar way as we do, i.e., increasing amodel’s causal knowledge will yield a higher place upon his ladder. Like Pearl, we have three different levelsin our scale. However, they do not correspond onetoone.
They rescale Pearl's ladder levels downwards and define a new scale, arguing that the original definition of counterfactual as a different level on it's own actually combines together mutiple types of added reasoning complexity.


proceedings.mlr.press proceedings.mlr.press

They think BON moves reward mass around from low reward samples to high reward samples

We find empirically that for bestofn (BoN) sampling
they foudn this relationship surpsing, but it does seem to fit better than other functions with mimic the general shape.
question: is tehre. agodo reason why?

d
they use sqrt since KL scales quadtraically, so it gets rid of the power 2.

RL
"for ... we don't see any overoptimization, we just see the .. monotonically improves"
For which, I don't see a linear growth here that might not bend down later.


arxiv.org arxiv.org

The MuZero paper for model based learning when the mdoel is not directly available.


www.semanticscholar.org www.semanticscholar.org

LLAMA 2 Release Paper


arxiv.org arxiv.org

Daniel Adiwardana MinhThang Luong David R. So Jamie Hall, Noah Fiedel Romal Thoppilan Zi Yang Apoorv Kulshreshtha, Gaurav Nemade Yifeng Lu Quoc V. Le "Towards a Humanlike OpenDomain Chatbot" Google Research, Brain Team
Defined the SSI metric for chatbots used in LAMDA paper by google.
Tags
Annotators
URL


arxiv.org arxiv.org

LaMDA pretraining as a language model.
Does this figure really mean anything? There is no 3 in the paper at all.

Safety does not seem to benefit much from model scaling without finetuning.
Safety does not seem to be improved by larger models.

How LaMDA handles groundedness through interactions with an external information retrieval system
Does LAmbda always ask these questions? How far down the chain does it go?

Daniel Adiwardana, MinhThang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. Towards a humanlike opendomain chatbot.arXiv preprint arXiv:2001.09977, 2020
SSI metric deifnitions

Using one model for both generation and discrimination enables an efficient combined generateanddiscriminateprocedure.
bidirectional model benefits

LaMDA Mount Everest provides factsthat could not be attributed to known sources in about 30% of response
Even with all this work, it will hallucinate about 30% of the time


arxiv.org arxiv.org

linear
W_\theta


arxiv.org arxiv.org

Because DDPG is an offpolicy algorithm, the replay buffer can be large, allowingthe algorithm to benefit from learning across a set of uncorrelated transitions.
Offpolicy algorithms can have a larger replay buffer.

One challenge when using neural networks for reinforcement learning is that most optimization algorithms assume that the samples are independently and identically distributed. Obviously, whenthe samples are generated from exploring sequentially in an environment this assumption no longerholds. Additionally, to make efficient use of hardware optimizations, it is essential to learn in minibatches, rather than online.As in DQN, we used a replay buffer to address these issues
Motivation for minibatches of training experiences and for the use of replay buffers for Deep RL.

The DPG algorithm maintains a parameterized actor function μ(sθμ) which specifies the currentpolicy by deterministically mapping states to a specific action. The critic Q(s, a) is learned usingthe Bellman equation as in Qlearning. The actor is updated by following the applying the chain ruleto the expected return from the start distribution J with respect to the actor parameters:∇θμ J ≈ Est∼ρβ[∇θμ Q(s, aθQ)s=st,a=μ(stθμ)]= Est∼ρβ[∇aQ(s, aθQ)s=st,a=μ(st)∇θμ μ(sθμ)s=st] (6)Silver et al. (2014) proved that this is the policy gradient, the gradient of the policy’s performance
The original DPG algorithm (nondeep) takes the ActorCritic idea and makes the Actor deterministic.

Interestingly, all of our experiments used substantially fewer steps of experience than was used byDQN learning to find solutions in the Atari domain.
Training with DDPG seems to require less steps/examples than DQN.

The original DPG paper evaluated the algorithm with toy problems using tilecoding and linearfunction approximators. It demonstrated data efficiency advantages for offpolicy DPG over bothon and offpolicy stochastic actor critic.
(nondeep) DPG used tilecoding and linear VFAs.

It can be challenging to learn accurate value estimates. Qlearning, for example, is prone to overestimating values (Hasselt, 2010). We examined DDPG’s estimates empirically by comparing thevalues estimated by Q after training with the true returns seen on test episodes. Figure 3 shows thatin simple tasks DDPG estimates returns accurately without systematic biases. For harder tasks theQ estimates are worse, but DDPG is still able learn good policies.
DDPG avoids the overestimation problem that Qlearning has without using Double Qlearning.

It is not possible to straightforwardly apply Qlearning to continuous action spaces, because in continuous spaces finding the greedy policy requires an optimization of at at every timestep; this optimization is too slow to be practical with large, unconstrained function approximators and nontrivialaction spaces
Why it is not possible for pure Qlearning to handle continuous action spaces.

Our contribution here is to provide modifications to DPG, inspired bythe success of DQN, which allow it to use neural network function approximators to learn in largestate and action spaces online
contribution of this paper.

Directly implementing Q learning (equation 4) with neural networks proved to be unstable in manyenvironments.

As with Q learning, introducing nonlinear function approximators means that convergence is nolonger guaranteed. However, such approximators appear essential in order to learn and generalizeon large state spaces.
Why Qlearning can't have guaranteed convergence.

We refer to our algorithm as Deep DPG (DDPG, Algorithm 1).


proceedings.mlr.press proceedings.mlr.press

IMPALA: Scalable Distributed DeepRL with Importance WeightedActorLearner Architectures
(Espeholt, ICML, 2018) "IMPALA: Scalable Distributed DeepRL with Importance Weighted ActorLearner Architectures"

We achieve stable learning at high throughput by combining decoupled acting and learningwith a novel offpolicy correction method calledVtrace.

we aim to solve a large collection oftasks using a single reinforcement learning agentwith a single set of parameters

the progress has been primarily in singletask performance

multitask reinforcement learning
Task: Multitask Reinforcement Learning

IMPALA (Figure 1) uses an actorcritic setup to learn apolicy π and a baseline function V π . The process of generating experiences is decoupled from learning the parametersof π and V π . The architecture consists of a set of actors,repeatedly generating trajectories of experience, and one ormore learners that use the experiences sent from actors tolearn π offpolicy.

an agent is trained on each task

scalability

separately

We are interested in developing new methodscapable of mastering a diverse set of tasks simultaneously aswell as environments suitable for evaluating such methods.
Task: train agents that can do more than one thing.

IMPALA actors communicate trajectoriesof experience (sequences of states, actions, and rewards) to acentralised learner

full trajectories of experience

aggressivelyparallelising all time independent operations

learning becomes offpolicy

IMPALA achieves exceptionally high data throughput rates of250,000 frames per second, making it over 30 times fasterthan singlemachine A3C

With the introduction of very deep model architectures, thespeed of a single GPU is often the limiting factor duringtraining.

IMPALA is also moredata efficient than A3C based agents and more robust tohyperparameter values and network architectures

IMPALA use synchronised parameter update which is vitalto maintain data efficiency when scaling to many machines

A3C


proceedings.mlr.press proceedings.mlr.press

This paper introduced the DPG Algorithm


openreview.net openreview.net

Link to page with information about the paper: https://openreview.net/forum?id=rJeXCo0cYX


openreview.net openreview.net

Yann LeCun released his vision for the future of Artificial Intelligence research in 2022, and it sounds a lot like Reinforcement Learning.
Tags
Annotators
URL


www.cs.toronto.edu www.cs.toronto.edudqn.pdf1

The paper that introduced the DQN algorithm for using Deep Learning with Reinforcement Learning to play Atari game.


arxiv.org arxiv.org

Paper that evaluated the existing Double QLearning algorithm on the new DQN approach and validated that it is very effective in the Deep RL realm.

Qlearning(Watkins, 1989) is one of the most popular reinforcementlearning algorithms, but it is known to sometimes learn unrealistically high action values because it includes a maximization step over estimated action values, which tends toprefer overestimated to underestimated values
Qlearning tends to overestimate the value of an action.

noise

unify these views

we can learn a parameterized value function

insufficiently flexible function approximation

Both the target networkand the experience replay dramatically improve the performance of the algorithm

The target used by DQN is then

show overestimationscan occur when the action values are inaccurate, irrespectiveof the source of approximation error
They show overestimations occur when there is approximation error in the value function approximation for Q(s,a).

θt

upward bias

In the original Double Qlearning algorithm, two valuefunctions are learned by assigning each experience randomly to update one of the two value functions, such thatthere are two sets of weights, θ and θ′

θ′t

while Double Qlearning is unbiased.

The orange bars show the bias in a single Qlearning update when the action values are Q(s, a) =V∗(s) + a and the errors {a}ma=1 are independent standardnormal random variables. The second set of action valuesQ′, used for the blue bars, was generated identically and independently. All bars are the average of 100 repetitions.


arxiv.org arxiv.org

DDPG

multiplying the rewards generated from an environment by some scalar

ELU

his is akin to clipping therewards to [0, 1]

network structure of
differernt activiation functions tried

Hyperparameters
hyperparameters: alpha, dropbox prob, number of layers in your network, width of network layers, activation function (RELU, ELU, tanh, ...), CNN?, RNN?, ..., , epsilon (for egreedy policy)
parameters: specific to problem  paramters of Q(S,a) and policy pi (theta, w), gamma (? how important is the future)

PPO


arxiv.org arxiv.org

TRPO uses a hard constraint rather than a penalty because it is hardto choose a single value of β that performs well across different problems

gradient estimator

we only ignore the change in probability ratio when it would make the objective improve,and we include it when it makes the objective worse.

ot sufficient to simply choose a fixed penalty coefficient β and optimize the penalizedobjective Equation (5) with SGD

objective function (the “surrogate” objective) is maximized
PPO is a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.
TRPO defines the problem as a straight optimization problem, no learning is actually involved.

Generalizingthis choice, we can use a truncated version of generalized advantage estimation

Without a constraint, maximization of LCP I would lead to an excessively large policyupdate; hence, we now consider how to modify the objective, to penalize changes to the policy thatmove rt(θ) away from 1
The policy iteration objective proposes steps which are too large. It uses a likelihood ratio of the current policy with and older version of the policy multiplied by the Advantage function. So, it uses the change in the policy probability for an action to weight the Advantage function.

our goalof a firstorder algorithm that emulates the monotonic improvement of TRPO,

A proximal policy optimization (PPO) algorithm that uses fixedlength trajectory segments isshown below. Each iteration, each of N (parallel) actors collect T timesteps of data. Then weconstruct the surrogate loss on these N T timesteps of data, and optimize it with minibatch SGD

Thefirst term inside the min is LCP I . The second term, clip(rt(θ), 1 − , 1 + ) ˆAt, modifies the surrogateobjective by clipping the probability ratio, which removes the incentive for moving rt outside of theinterval [1 − , 1 + ]. Finally, we take the minimum of the clipped and unclipped objective, so thefinal objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective
The "clip" function cuts off the probability ratio output so that some changes in Advantage are ignored.

Clipped Surrogate Objective
