Policy learning from demonstration, in its simplest form, canbe formulated as the supervised regression task of learning tomap observations to actions.
The target problem is Policy Learning from Demonstration for robotics.
Policy learning from demonstration, in its simplest form, canbe formulated as the supervised regression task of learning tomap observations to actions.
The target problem is Policy Learning from Demonstration for robotics.
Diffusion Policy learns the gradient of the action-distribution scorefunction and iteratively optimizes with respect to this gradient field during inference via a series of stochastic Langevindynamics steps. We find that the diffusion formulation yields powerful advantages when used for robot policies, includinggracefully handling multimodal action distributions, being suitable for high-dimensional action spaces, and exhibitingimpressive training stability.
What assumptions does it make about similarity of actions and smoothness of the policy space?
New transitions are inserted into the replay
New transitions don't have any initial evaluation, so there priority is set to maximum initially. So they have high probability of being selected, but it is not certain.
In entropy-regularized reinforcement learning, the agent gets a bonus reward at each time step proportional to the entropy of the policy at that timestep.
Agent gets a benefit from increased entropy (ie. randomness) in the policy. So SAC biases towards policies with more exploration.
Our optimization thus needs tobe able to create geometry and also destroy or move geometry if ithas been incorrectly positioned.
how does it know if teh gemoetry is incorrect? the given training images?
The 3D Splatting paper, an alternative to NERFs for 3D rendering of objects from still images.
The CLIP paper which learned a model for generating captioning text for an arbitrary image.
One possible way to directly ground thinking in the external world is to build a world model [37] thatpredicts the consequences of the agent’s actions upon the world, including predicting reward.
Could general AI agents using RL have their own detailed-enough model of the world to ground their reasoning?
n agent trained to imitate human thoughts oreven to match human expert answers may inherit fallacious methods of thought deeply embedded within thatdata, such as flawed assumptions or inherent biases
imitating human thought or speech will not be sufficient for reasoning about all situations.
it is highly unlikely that human language provides the optimal instance of a universal computer
very true
The majority of high-quality data sources - those that can actuallyimprove a strong agent’s performance - have either already been, or soon will be consumed
We will soon run out of new data to train new kinds of supervised learning systems.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Reinforcement learning via supervised learning framework(RvS) [Schmidhuber, 2019] is another important paradigm inoffline RL, which eliminates Q-learning thus free of extrapo-lation errors. RvS learns a policy conditioned on the observedreturns via supervised learning and then conditions it on ahigh return to generate desired behaviors [Chen et al., 2021].Similar to policy constraining, RvS requires fitting the en-tire dataset. Therefore, the expressiveness of parameterizedpolicies also matters in RvS.
Schmidhuber's approach of doing offline RL by converting it to supervised learning. This is a fair approach, but how to make it generalize? Maybe diffusion will be better.
We explore our hypothesis by considering offline RL, where we will task agents with learning policiesfrom suboptimal data – producing maximally effective behavior from fixed, limited experience.
Offline RL is the domain.
We consider the following shift in paradigm: instead of training a policy through conventionalRL algorithms like temporal difference (TD) learning [6], we will train transformer models oncollected experience using a sequence modeling objective. This will allow us to bypass the need forbootstrapping for long term credit assignment – thereby avoiding one of the “deadly triad” [6 ] knownto destabilize RL. It also avoids the need for discounting future rewards, as typically done in TDlearning, which can induce undesirable short-sighted behaviors. Additionally, we can make use ofexisting transformer frameworks widely used in language and vision that are easy to scale, utilizing alarge body of work studying stable training of transformer models
A "new paradigm", no deadly triad, etc. Will it work?
Some notes on the standard MCTS wikipedia page for class.
the search attempts to prune sequences which are less relevant. In some cases, a play can lead to a very specific line of play which is significant, but which is overlooked when the tree is pruned, and this outcome is therefore "off the search radar"
disadvantage 1 : like any pruning algorithm that reduces the search space, there is a risk of closing doors, of cutting off important paths too early.
it achieves better results than classical algorithms in games with a high branching factor
advantage 2 : deals well with a higher branching factor, it can choose to expand more in one area than another.
pure Monte Carlo tree search does not need an explicit evaluation function
MCT with UCT converges to the minimax algorithm for certain restricted games, but also has other advantages.
advantage 1: just like RL in general, the evaluation function can be implicit, coming from rollouts of the game and evaluating the outcome, so no explicit function for evaluation is needed for all states.
We trained a 13-layer policy network, which we call the SL policynetwork, from 30 million positions from the KGS Go Server. The net-work predicted expert moves on a held out test set with an accuracy of57.0% using all input features, and 55.7% using only raw board posi-tion and move history as inputs, compared to the state-of-the-art fromother research groups of 44.4%
The Supervised Learning (SL) Policy is a CNN that predicts the next moved trained on expert games. performance is better than previous work but still under 60% accuracy.
The training dataset of prompt-generation pairs for the RM is generated by sampling a set of prompts from a predefined dataset (Anthropic’s data generated primarily with a chat tool on Amazon Mechanical Turk is available on the Hub, and OpenAI used prompts submitted by users to the GPT API). The prompts are passed through the initial language model to generate new text.
An example of a prompt training model to weed out offensive, misleading, incorrect text outputs. WARNING: the linked dataset has some very offensive examples.
DeepMind used a similar reward setup for Gopher but used synchronous advantage actor-critic (A2C) to optimize the gradients, which is notably different but has not been reproduced externally.
This is old, we should check if Gemini still uses A2C, but the choice of PPO is a bit arbitrary.
the update rule is the parameter update from PPO that maximizes the reward metrics in the current batch of data (PPO is on-policy, which means the parameters are only updated with the current batch of prompt-generation pairs). PPO is a trust region optimization algorithm that uses constraints on the gradient to ensure the update step does not destabilize the learning process.
The update is exactly a policy gradient RL update of the parameters of the policy. Even though there is no time dynamics, and no real state in the MDP sense.
Human annotators are used to rank the generated text outputs from the LM. One may initially think that humans should apply a scalar score directly to each piece of text in order to generate a reward model, but this is difficult to do in practice. The differing values of humans cause these scores to be uncalibrated and noisy. Instead, rankings are used to compare the outputs of multiple models and create a much better regularized dataset.
The underlying goal is to get a model or system that takes in a sequence of text, and returns a scalar reward which should numerically represent the human preference.
The main idea of using this method. Note that this is a huge simplification of what a "good" or "bad" text output is, but it is fully general and can be used as a reward in an RL algorithm.
John Schulman
John Schulman created the PPO algorithm as part of his PhD research. Later was cofounder of OpenAI and became head of RL research and development where his team developed the modern RLHF method for LLM alignment. He now works on allignment at Anthropic.
Our focus on this specific task is spurred by not only the fact that reasoning about actionsand change is a core aspect of human intelligence, but also that it is required for many of the tasksconsidered as potential applications of LLMs including automatic code generation, moral and evendeontological reasoning
Planning relates to other kinds of reasoning that go beyond word prediction. To believe the bold claims of LLM reasoning, we need to have rigorous experiments on other types of reasoning.
in this paper, we want to look at the ability of large language models to do reasoning aboutactions and change involving common-sense planning tasks.
Their goal
Of particular interest is the thread of efforts that aim toevaluate (and showcase) the LLM’s ability to do reasoning tasks. For example, there are many claimscentered around the fact that GPT-3 may possess some form of reasoning ability [16 ]. Such sourcesgenerally assume that because the model learned from large amounts of real-world text, it may haveacquired some approximation of simple reasoning. This sparked interest in evaluating the largelanguage models on various reasoning tasks including common-sense reasoning [ 29, 25 , 8], logicalreasoning [27 ], and even ethical reasoning [ 13]. The macro-tenor of the drumbeat of these works hasbeen suggesting that LLM’s are indeed capable of doing such kinds of reasoning [15, 33, 5].
The drumbeat of reasoning claims...
@inproceedings{valmeekam2022large,<br /> title={Large language models still can't plan (a benchmark for LLMs on planning and reasoning about change)}, author={Valmeekam, Karthik and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao}, booktitle={NeurIPS 2022 Foundation Models for Decision Making Workshop}, year={2022} }
these models are very good at a kind of pattern recognition, but often fail when they encounter novelty that forces them beyond the limits of their training
The Claim
neural networks of various kinds can generalise within a distribution of data they are exposed to, but their generalisations tend to break down beyond that distribution
True as it's always been at a high level. But many neural networks do generalize well in ways that we feel is surprising and impressive. But usually these cases turn out to still be within the distribution due to good inductive biases of the network.
Playing Atari with Deep Reinforcement Learning 19 Dec 2013 · Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller
The paper from 2013 that introduced the DQN algorithm for using Deep Learning with Reinforcement Learning to play Atari game.
Really interesting paper pointing out how the standard metrics used in point cloud, and lidar, analysis are not appropriate for all contexts, especially for machine learning.
We empiricallyobserve that DCD usually provides a more consistent and reliable evaluation, especially when the2
Their method does better especially when CD and EMD disagree
x, y, z
question: where did the depth come from initially?
Welcome to the Era of ExperienceDavid Silver, Richard S. Sutton
Welcome to the Era of Experience David Silver, Richard S. Sutton
"This is a preprint of a chapter that will appear in the book Designing an Intelligence, published by MIT Press"
could be very harmful when applying fine-tuning based approaches to token-level tasks suchas question answering, where it is crucial to incor-porate context from both directions.
were they right or wrong, in hindsight?
JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 14, NO. 8, AUGUST 2023
Zhen Yang. “Does Negative Sampling Matter? A Review with Insights into its Theory and Applications”. JOURNAL OF IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 14, NO. 8, AUGUST 2023
Attention(Q, K, V ) = softmax( QKT√dk)V
Scaled Dot-Product Attention Formula
Examples of mistakes where we can use attention to gain intuition into what the model saw.
Perhaps the best use of this approach is for looking for mistakes or understanding why a model does badly on certain data instances.
Visualization of the attention for each generated word. The rough visualizations obtained by upsampling the attention weightsand smoothing. (top)“soft” and (bottom) “hard” attention (note that both models generated the same captions in this example)
In a trained model each word correlates to strong responses from certain parts of the image.
Our model learns a words/image alignment.
The high level view of the attention model.
Raw data and explanation of a badmodel’s prediction in the “Husky vs Wolf ” task
Famous example of how supervison model can overfit to extraneous data. Attention can help to discover these.
Open source dataset for using LLMs to location root causes for microservices
we applyIsolation Forest Method [ 17 ], an unsupervised machine learningalgorithm to filter out anomalies from the sensor data.
They mention iForest, but cite iMondrian Forest. Which do they use?
One problem that came up was that the online iForest solution did producesimilar results as the Scikit-learn libraries
which online iforest solution?
he ROC curve in the figures shows us that the solution isperforming well, in some cases even better than the solution with the regulariForest
got better results than iforest in some cases.
H. Ma, B. Ghojogh, M. N. Samad, D. Zheng and M. Crowley, "Isolation Mondrian Forest for Batch and Online Anomaly Detection," 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 2020, pp. 3051-3058, doi: 10.1109/SMC42975.2020.9283073.
The algorithm fuses two ideas, "isolation" from ensemble trees methods for anomaly detection and "Mondrian forests" which can learn flexible regression models from streaming data.
Interesting position paper about how to have useful discussion about AGI and Foundation models.
Detailed explanation of what DeepSeek model is doing differently to improve performance and training time over ChatGPT.
MAPPING SOCIAL CHOICE THEORY TO RLHF Jessica Dai and Eve Fleisig ICLR Workshop on Reliable and Responsible Foundation Models 2024
Nice overview of how social choice theory has been connected to RLHF and AI alignment ideas.
e sample minibatches of sequence lengthK from the dataset. The prediction head corresponding to the input token st is trained to predict at –either with cross-entropy loss for discrete actions or mean-squared error for continuous actions – andthe losses for each timestep are average
How training loss is compute.
indicator notation,
in class our indicator notation looks different, we'd write:
$$h_m(x) = \sum_{j=1}^{J_m} b_{jm}\mathbb{I}{R{jm}} (x)$$
. The tree partitions the input space into J m {\displaystyle J_{m}} disjoint regions R 1 m , … , R J m m {\displaystyle R_{1m},\ldots ,R_{J_{m}m}} and predicts a constant value in each region.
developers can use similarity searches to “remember” relevant information to the current prompt:
Similarity Search: similarity to the search text? to he situation plus the text? This is an interesting task.
One of the best sources of information in the game world is the game itself. Game state can be transcribed into text so that a SLM can reason about the game world
What is the mapping between the Game World and the Real world?
with the help of generative AI, and large language models trained on trillions of sentences describing how humans react to the world, we can start to simulate human-like decision making.
So their training basis is real human responses to the world scraped from internet data. Question: Do they limit themselves to conversations about "taking actions in the world"? How would they define that dataset?
to incorporate ACE autonomous game characters into their titles.
So, would all autonomous game characters have the same strategies and personalities that arise from NVIDIA implementation, regardless of the game or platform? Interesting.
Figure 1: Mathematics can illuminate the ways that ReLU-based neural networks shatter input space into countless polygonal regions, in each of which the model behaves like a linear map [2, 3, 4]. These decompositions create beautiful patterns. (Figure made with SplineCam [5]).
Fascinating! What RELU does.
Great review of mathematical patterns and insights about recent ML research, and discussion of how the often complicated relationship between math and ML progress is playing out in the LLM era.
Likemany existing option discovery methods, we too make theassumption that all options are available everywhere, i.e.,∀s ∈ S, ∀ω ∈ Ω : s ∈ Iω . However, we show that ourapproach ends up relaxing this assumption, in effect, andprovides an elegant way to learn distinct initiation sets foroptions
general assumptions about option generation in general
The introduction of Transformers, such as GPT-4, took the field of natural language processing (NLP) and established benchmarks for several natural language tasks. Longer sequences have long been a thorn in the side of transformers as they significantly hamper their efficiency. This deficiency is where Mamba excels. Namely, mamba can process lengthy sequences more quickly than transformers and does so more simply due to its unique architecture.
Focus of mamba is on efficiently modelling long range dependencies, and allowing transitions to vary over "time"
good article on Mamba architecture vs transformers
We simplify prior deep sequence model architectures by combining the design of prior SSM architectures(Dao, Fu, Saab, et al. 2023) with the MLP block of Transformers into a single block, leading to a simple and homogenousarchitecture design (Mamba) incorporating selective state spaces
So the main idea is to unify the multi-layered MLP block into a single unified one?
Every Model Learned by Gradient Descent Is Approximatelya Kernel Machine
"Every Model Learned by Gradient Descent Is Approximately a Kernel Machine" by Pedro Domingos, 2020.
the meaning of observational concepts is influenced by theoretical assumptions and presuppositions. For example, the concepts “mass” and “length” have different meanings in Newtonian and relativistic mechanics; so does the concept “temperature” in thermodynamics and statistical mechanics
Good example of how everything in a theory that we think is truth relates to the concepts of our paradigm.
This position has been adopted by Karl R. Popper, Rudolf Carnap and other leading figures in (broadly) empiricist philosophy of science. Many philosophers have argued that the relation between observation and theory is way more complex and that influences can actually run both ways (e.g., Duhem 1906 [1954]; Wittgenstein 1953 [2001]). The most lasting criticism, however, was delivered by Thomas S. Kuhn (1962 [1970]) in his book “The Structure of Scientific Revolutions”.
Competing views about the relation between observations reality and truth. Popper argues that observations help us distinguish which theories are true or not plus bringing us always closer to a more true scientific theory. Wittgenstein argues this can go both ways. Kuhn argues that these are observations are couched in the language of our paradigm and so everything is relative to that.
the rewards are divided through by the standard deviation of a rolling dis-counted sum of the reward
big reward shaping
we find that they dramatically affect the performanceof PPO. To demonstrate this, we start by performing a full ablation study on the four optimizationsmentioned above
All these little optimizations in the implementation of PPO have a big impact on it's performance.
In brief, DCD takes a step from CD and attempts to provide a rationale bridge towards EMD for abetter sense of point distribution rather than being blinded by its nearest neighbour. Compared withEMD, it is not only more efficient but also stricter with local structures. A balanced distributionand good preservation of detailed structures are both important factors for the visual quality of thecompletion result.
DCD is an improvement on CD towards the very expensive EMD method.
Chamfer Distance between two point sets S1 and S2 is defined as
need to do both directions because of the minimization
Using input transformation givesa 0.8% performance boost.
but what is the input transformation?
Q(λ) is a variant of Q-learning where eligibility traces are used to calculate the TD error. Asmentioned previously, the backwards view of traces is traditionally used
The version of TD(lambda) they are using.
to provide a forecast Probability of Fire (PoF) on a given day within a 9 by9 km grid cell
Their regression task.
To effectively fuse the multi-view information, we propose a geometrically-guided projective attentionmechanism. Instead of applying full attention to densely aggregate features across spaces and views,it projects the estimated 3D joint into 2D anchor points for different views, and then selectivelyfuses the multi-view local features near to these anchors to precisely refine the 3D joint location. wepropose to encode the camera rays into the multi-view feature representations via a novel RayConvoperation to integrate multi-view positional information into the projective attention. In this way, thestrong multi-view geometrical priors can be exploited by projective attention to obtain more accurate3D pose estimation.
Definition:: projective attention
It takes into account the 3D space the points live in, and the rays of light that explain their 2D preojcetions.
MvP : "Direct Multi-view Multi-person 3D Pose Estimation" Tao Wang, Jianfeng Zhang, Yujun Cai, Shuicheng Yan, Jiashi Feng
Influential paper on learning consistent skeletal models of human pose from multiview images
MvP designs a novel geometricallyguided attention mechanism, called projective attention, to more precisely fuse thecross-view information for each joint.
question: what is projective attention?
Since the number of queries is larger than the actualnumber of people, we train an MLP-based classifier fβ (.) topredict a score for each query based on the appearance termto remove the “empty” ones.
Initially there are more queries than there are actual pedestrians. A classifier is trained to prune out the non-people.
Really interesting and innovative method for using multiview perspective data to learn human pose and pedestrian detection.
We adopt a hierarchical query embed-ding scheme proposed in [36] to reduce the number of learn-able parameters.
a hierarchical scheme to reduce learning paramters, if you know something about the model, that's good!
Most closely related to our work, MvP [36]extends DETR for multi-view 3D human pose estimation.
mostly based on [36]
APU-LDI: Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling
@inproceedings{li2024LDI, title={Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling}, author={Li, Shujuan and Zhou, Junsheng and Ma, Baorui and Liu, Yu-Shen and Han, Zhizhong}, booktitle={Proceedings of the AAAI Conference on Artificial Intelligence}, year={2024} }
We propose a variant of autoencoderswhich can work with two views of the data, while being explicitly trained to achieve all
The goal is to build an autoencoder that learns a common representation of a single object when given multiple perspectives during training.
"Correlational neural networks" - looking at learning from multiple perspectives of the same thing to increase representation learning.
@article{chandar2016neuralcompjour, author = {Chandar, Sarath and Khapra, Mitesh M and Larochelle, Hugo and Ravindran, Balaraman}, date-added = {2024-08-01 10:47:30 -0400}, date-modified = {2024-08-01 10:50:01 -0400}, journal = {Neural Computation}, keywords = {correlation-learning, machine-learning, inductive-bias, autoencoders}, number = {2}, pages = {257--285}, pdf = {https://www.researchgate.net/profile/Balaraman-Ravindran/publication/275588055_Correlational_Neural_Networks/links/55ed84d308ae21d099c75c00/Correlational-Neural-Networks.pdf}, publisher = {MIT Press}, title = {Correlational neural networks}, venue-short = {NeuralCompJour}, volume = {28}, year = {2016}}
his suggests that in scenarios with relativelylow amounts of data, Decision Transformer can outperform %BC by using all trajectories in thedataset to improve generalization, even if those trajectories are dissimilar from the return conditioningtarget. Our results indicate that Decision Transformer can be more effective than simply performingimitation learning on a subset of the dataset. On the tasks we considered, Decision Transformer eitheroutperforms or is competitive to %BC, without the confound of having to select the optimal subset
So it seems like it isn't just behaviour cloning. It works better than it with smaller amounts of training data, so it generalizes well. But with large amounts of data, copying human behaviour may be good enough.
Formally, the learned reward function can be defined as ΦR : (G, V ) → R that maps a language goalG and a video snippet V to a scalar reward. An ideal ΦR should return a high reward if the behaviordepicted in the video faithfully follows the language description, and a low reward otherwise
The essential ideas of MineCLIP. A function trained on youtube videos of minecraft that takes in a description of an activity, a video purporting to complete it, and a reward score for how well it did.
Agents developed in popular RL benchmarks [ 119 , 146 ] often rely on meticulously crafted dense andtask-specific reward functions to guide random explorations. However, these rewards are hard or eveninfeasible to define for our diverse and open-ended tasks in MINEDOJO. To address this challenge, ourkey insight is to learn a dense, language-conditioned reward function from in-the-wild YouTubevideos and their transcripts. Therefore, we introduce MINECLIP, a contrastive video-languagemodel that learns to correlate video snippets and natural language descriptions (Fig. 4). MINECLIPis multi-task by design, as it is trained on open-vocabulary and diverse English transcripts
Designing a reward function is expensive and difficult. They learn one from the rich dataset they have for this domain.
simpler tree search that relies upon thissingle neural network to evaluate positions and sample moves, without performing any Monte-Carlo rollouts.
simpler network for evaluation of positions, not MCTS in the value update
new reinforcement learning algorithm thatincorporates lookahead search inside the training loop, resulting in rapid improvement and preciseand stable learning
lookahead still happens, but now is inside the training loop
Unlike earlier versions of AlphaGo, Zero only perceived the board's stones, rather than having some rare human-programmed edge cases to help recognize unusual Go board positions.
AlphaGo Zero didn't even know the rules of Go.
We use a reward function r(s) that is zero for allnon-terminal time steps t < T. The outcome zt = ± r(sT) is the termi-nal reward at the end of the game from the perspective of the currentplayer at time step t: +1 for winning and −1 for losing.
reward function is as simple and sparse as possible, using the only thing you know for certain, whether you won or lost the game.
Using no searchat all, the RL policy network won 85% of games against Pachi
When played head-to-head, the RL policynetwork won more than 80% of games against the SL policy network.
First test for the RL policy was to beat the SL Policy.
The summary paper for AlphaGo.
Monte Carlo tree search in AlphaGo.
Showing how monte carlo tree search works in alphago
The policy network
nice view of the policy and value networks in action
Most contemporary implementations of Monte Carlo tree search are based on some variant of UCT
The UCB algorithm for bandits comes back again as UCT to form the basis for model estimation via MCTS
The main difficulty in selecting child nodes is maintaining some balance between the exploitation of deep variants after moves with high average win rate and the exploration of moves with few simulations.
Tree search makes this tradeoff very clear, how many paths will you explore before you stop and use the knowledge you already have?
A realy nice visual history of the development of Deep Learning, the cornerstone of modern AI and ML.
An illustration of alpha–beta pruning. The grayed-out subtrees don't need to be explored (when moves are evaluated from left to right), since it is known that the group of subtrees as a whole yields the value of an equivalent subtree or worse, and as such cannot influence the final result. The max and min levels represent the turn of the player and the adversary, respectively.
Alpha-Beta pruning comes down to being smart about searching the tree of possible future game states to be more efficient about rollouts.
For example, the chess computer Deep Blue (the first one to beat a reigning world champion, Garry Kasparov at that time) looked ahead at least 12 plies, then applied a heuristic evaluation function.[6]
Deep Blue used a kind of minimax algorithm to beat Garry Kasparov at chess, 12 step lookehead.
Comparing Monte Carlo tree search searches, AlphaZero searches just 80,000 positions per second in chess and 40,000 in shogi, compared to 70 million for Stockfish and 35 million for Elmo. AlphaZero compensates for the lower number of evaluations by using its deep neural network to focus much more selectively on the most promising variation.[1]
The model allows it to be selective about what rollouts to do during MCTS
Wikipedia: AlphaZero
AZ has hard-coded rules for setting search hyperparameters.
that's interesting...
Overall, our results highlight the necessity of designing deep RL methods in a modular manner.
Modularity is more important than a big idea.
It turns out that the clipping mechanism is not necessary to achieve high performance—we findthat PPO-NOCLIP performs uniformly better than PPO-M, despite the latter employing the corePPO clipping mechanism.
maybe clipping isn't so important?
We find that varying the use of code-level optimizations impactsperformance significantly more than varying whether the PPO or TRPO step is used.
Writing better code had a bigger impact than the difference in the algorithm!
The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e.,solving the unconstrained optimization problemmaximizeθˆEt[ πθ(at | st)πθold (at | st) ˆAt − β KL[πθold (· | st), πθ(· | st)]](5)for some coefficient β
This parameter \Beta is a bit mysterious. PPO works very well generally, but setting \Beta is tricky, and influences other parts of the algorithm.
PPO has been around for a relatively long time
7 years is a long time apparently!
At this point in the RLHF system, we have an initial language model that can be used to generate text and a preference model that takes in any text and assigns it a score of how well humans perceive it.
The basic input/output structure needed.
By comparing model outputs in head-to-head matchups, an Elo system can be used to generate a ranking of the models and outputs relative to each-other. These different methods of ranking are normalized into a scalar reward signal for training.
Human's as judge of relative performance of two LLMs, scores become reward signal.
OpenAI fine-tuned on human-generated text that was “preferable” and Anthropic generated their initial LM for RLHF by distilling an original LM on context clues for their “helpful, honest, and harmless” criteria
(optional) produce human augmented data to demonstrate a better sentence response.
2024 paper arguing that other methods beyond PPO could be better for "value alignment" of LLMs
Through experimental methods focusing on PG methodsfor continuous control, we investigate problems with repro-ducibility in deep RL. We find that both intrinsic (e.g. randomseeds, environment properties) and extrinsic sources (e.g. hy-perparameters, codebases) of non-determinism can contributeto difficulties in reproducing baseline algorithms. Moreover,we find that highly varied results due to intrinsic sourcesbolster the need for using proper significance analysis. Wepropose several such methods and show their value on asubset of our experiments.
Their findings, random seeds matter (unfortunately).
Paper "Deep Reinforcement Learning that Matters" on evaluating RL algorithms.
T. Herlau, "Moral Reinforcement Learning Using Actual Causation," 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 2022, pp. 179-185, doi: 10.1109/ICCCR54399.2022.9790262. keywords: {Digital control;Ethics;Costs;Philosophical considerations;Toy manufacturing industry;Reinforcement learning;Forestry;Causality;Reinforcement learning;Actual Causation;Ethical reinforcement learning}
Can model-free reinforcement learning explain deontological moraljudgments?Alisabeth AyarsUniversity of Arizona, Dept. of Psychology, Tucson, AZ, USA
Parks, S.A.; Dillon, G.K.; Miller, C. A New Metric for Quantifying Burn Severity: The Relativized Burn Ratio. Remote Sens. 2014, 6, 1827-1844. https://doi.org/10.3390/rs6031827
Widely used model for #fire-severity prediction for forest wildfires in Canada and USA.
Briefly, these gridded datasets were built using an observed, satellite-derived measure of fire severity (Parks et al. 2014) and statistical models in which the probability of stand-replacing fire was modeled as a function of fuel, topography, climate, and weather. For a subset of ecoregions in our study area (Colorado Plateau, AZ–NM Mountains, and Apache Highlands), Parks et al. (2018b) also produced gridded datasets representing the probability of stand-replacing fire under extreme fire weather conditions.
prior work on predicting fire severity using a fixed model
Paper using fire risk prediction model.
Hubinger, et. al. "SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING". Arxiv: 2401.05566v3. Jan 17, 2024.
Very disturbing and interesting results from team of researchers from Anthropic and elsewhere.
You know XGBoost, but do you know NGBoost? I'd passed over this one, mentioned to me by someone wanting confidence intervals in their classification models. This could be an interesting paper to add to the ML curriculum.
GPT-4 System CardOpenAIMarch 23, 2023
denote dimensions0 through i − 1 of the state
Very odd/interesting! dimensions are independent but we are doing them in order?
τ<t to denote a trajectory from timesteps 0 through t − 1
tau<t short hand for all the previous s_t-1, a_t-1 etc.
lower-diagonal attention mask
why lower-diagonal?
Transformer architectures feature a “causal” attentionmask to ensure that predictions only depend on previous tokens in a sequence
Causal is in quotes here for a good reason. It is called causal attention mask in the LLM literature, but it has only to do with the probaility of the next token/word. It isn't attached to the meaning of the words at all.
We can use this directly as a goal-reaching method by conditioning on a desired final state sT .
interesting, goal directed RL cast as a sequence of samples from conditional probabilities
If we set the predictedsequence length to be the action dimension, our approach corresponds exactly to the simplest form ofbehavior cloning with an autoregressive policy
why is that? because the sample from the actions will be a proper sample? why would the sequence length ever be larger then?
Pθ (· | x)
where does the distritbuion come from initially? empircal?
Uniform discretization has the advantage that it retains information about Euclidean distance inthe original continuous space, which may be more reflective of the structure of a problem than thetraining data distribution.
always important to consider, if the relative magnitudes between points is important
modeling considerations are concerned lesswith architecture design and more with how to represent trajectory data – potentially consisting ofcontinuous states and actions – for processing by a discrete-token architecture
They don't care what kind of transformer is being used, they are interested in how to get SASASASA into the right form.
good question: what about continuous states and/or actions?
Concurrently with our work, Chen et al. (2021) also proposed an RL approach centered aroundsequence prediction, focusing on reward conditioning as opposed to the beam-search-based planningused by the Trajectory Transformer.
This is the Decision Transformer paper we read last week
Modeling the states and actions jointly already provides a biastoward generating in-distribution actions, which avoids the need for explicit pessimism
pessimism is a popular method to avoid (overfitting?) of the learned dynamics to what you saw. Since transformers maintain a huge context, this isn't needded, the predictions will always be tied to the same situation as in the training data
model-based RL
learn the dynamics, then optimize via RL
stimateconditional distributions over actions
policy as a distribution over actions
While such works demonstrate the importance of such models for representingmemory (Oh et al., 2016), they still rely on standard RL algorithmic advances to improve performance
is the sequence modeling for just learning the model or is it deeper?
The Trajectory Transformeris a substantially more reliable long-horizon predictor than conventional dynamics models
So the TT becomes a new type of model based RL
When decoded with a modified beam search procedure that biases trajectory samples according totheir cumulative reward,
so beam search is just a decoder of the learned dynamics that optimizes for reward?
Reading this one on Nov 27, 2023 for the reading group.
Reading this one on Nov 27, 2023 for the reading group.
K = 50 for Pong, K = 30 for others
**Q: ** where did these numbers come from
loss = mean (( a_preds - a )**2)
supervised learning for RL task
We feed the last K timesteps into Decision Transformer, for a total of 3K tokens (onefor each modality: return-to-go, state, or action)
Data - K timesteps with three tokens per timestep - return-to-go token - state token embedding - action token - token embedding for each token - linear (or convolutional) layer to learn - normalize - timestep embedding - embedding of the time index itself, adjusting for 3x? - question: added or concatenated? is timestep embedding on the raw tokens or on the emebdding?
Does Decision Transformer perform behavior cloning on a subset of the data?
good questions
we use the GPT architecture [ 9 ], which modifies the transformer architecture with a causal self-attention mask to enable autoregressive generation, replacing the summation/softmax over the ntokens with only the previous tokens in the sequence (j ∈ [1, i]).
This sentence is working hard.
this allows the layer to assign “credit” by implicitly forming state-returnassociations via similarity of the query and key vectors (maximizing the dot product)
that's a different way of thinking about what's happening in a transformer.
We then use a similar QA summarization framework as Wu et al. (2023) which produces QA dialogueon game mechanics
Q: what was the main focus of this paper?
A: "Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals"
Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual
LATEX source code
Q: why are they using the source code and not the text output?
all prior works require expert or human generated example trajectories
Training the LLMs using generated trajectories.
Wu et al. (2023) proposes a summary (Read) and reasoning (Reward) through a QA promptingframework with an open-source QA LLM Tafjord and Clark (2021). The framework demonstratesthe possibility of an using real-world human-written manuals to improve RL performance on populargames, despite limiting the interaction types to only “hit”. Our framework handles all 17 kinds ofinteractions available in the game. Moreover, our framework makes use of information on tech-treedependencies, and suggestions on desired policies extracted from the academic paper
Main paper they are based on.
Indicate their priority out of 5
Q: Where does "priority" even come from for the LLM for a domain like this? What prior knowledge and biases are built in here?
The visual descriptor takes the last two gameplay screens as input, andoutputs their descriptions in language (dt, dt−1)
Q: so does the language it uses internally keep changing?
Answer to the final question qa is mapped to environment action usingsub-string matching.
Q: is this explained in more detail anywhere?
Experimentally, we find that prompting the LLM with only the direct parents of a question greatlyreduces the context length, and helps LLM to focus on the most relevant contextual information
Interesting: What is being given up here? You need to cut or summarize context at some point for sure. But when?
model-based methods like DreamerV2 Hafner et al. (2020);DreamerV3 Hafner et al. (2023)
Summary: how do these methods work?
We add the prompt “DO NOT answer inLaTeX.” to all of Qgame to prevent the LLM from outputting the list in LATEX format
does GPT 3.5 understand latex that well?
in an environmentwhere control tasks are less required
Q: what do they mean by this?
zero-shot LLM-based (GPT-4) policy
What does "zero-shot" mean when it involves an LLM?
,we promote and regulate in-context chain-of-thought reasoning in LLMs to solvecomplex games. The reasoning module is a directed acyclic graph (DAG), with questions as nodesand dependencies as edges. For example, the question “For each action, are the requirements met?"depends on the question “What are the top 5 actions?", creating an edge from the latter to the former.For each environment step, we traverse the DAG computing LLM answers for each node in thetopological order of the graph. The final node of the DAG is a question about the best action to takeand the LLM answer for the question is directly translated to environment action
seems sensible
decidingthe paragraphs that are relevant for playing the game
this could be very subjective
the environment is OOD to them.
Translation: the Crafter game is too new for GPT to know about
In a nutshell, the CHT seems to disprove the scaling hypothesis.Or does it? In this work, we argue that foundation models might be exploiting a “loop hole” in the CHT4.Namely, what happens if the causal assumptions (which are required, by the CHT, for causal inference) arerepresented in observational data itself?
Are LLMs exploiting a loophole in Pearl's ladder?
It's not really a loophole, it's just that observational dataset that explicitely contains answers to your interventional queries.
Plato. Republic: Allegory of the cave, 375 BC
ok, you win.
Same Implication, Different Representations
Big Question: they cover text and experiment, but what about embodied experience? What is it's role? We believe in causality for very visceral (ie. physical and unavoidable) reasons as human beings.
eg. we touch a hot stove and then it hurts
we expect P (YX←1 = 1) = P (Y = 1) since intervening on X will notchange Y
Q: is that correct? wouldn't you need to show the \(X\leftarrow 0\) case to demonstrate this?
the probability of a high number of Nobel laureates if the given chocolate consumption were to behigh.
example of an L2 interventional query.
Q: For this query \(P(Y){x\leftarrow 1}=1\) wouldn't the more correct english translation be:
"The probability of having a high number of Nobel laureates if high chocolate consumption was made mandatory."
We call these concepts ‘meta’ since they are one level above ‘regular’, simple SCM in thesense that they encode information about answering causal questions in another SCM.
keep reading this sentence until it makes sense...or argue why it doesn't make sense
More intriguingly, it does not matter where that L2 fact comes from since the formulation is independent ofwhether the model learns the fact and simply requires that the model knows about the fact. We state oursecond key insight as
good point to remember, we don't need to learn everything, some knowledge can be encoded directly, a priori.
Example 1 serves to show how the rather abstract definition of an SCM can be made tangible to communicatewhat we believe about our observed data and more so the underlying data generating process.
Does everyone agree that it's crystal clear now? (maybe not...)
The Pearl’s Causal Hierarchy
An important theoretical framework to read up on if you aren't familiar with it.
It is clear how the observed correlation in this case corresponds to a directcausation according to
We should draw these models out
These models are castles in theair. They have no foundations whatsoever.” discrediting the models for lacking any identifiable notion tocausality.
discussion: Do we really need to just pick one of these options?
Our explanation for this is that they are not only ‘stochastic parrots’ as already suggested by Benderet al. (2021) but sometimes also ‘causal parrots’ since they will also encounter correlations over causal factsduring training in their vast oceans of textual data.
Q: what wsa Bender's argument exactly?
parameterizedvariants of SCMs such as the neural ones presented in (Xia et al., 2021
to read: this sounds interesting
y meta SCM
Q: definition needed
However, this conclusion is arguably nothing new, as most people wouldagree, and this is partly so because such obtained knowledge has been embedded as textual articles into en-cyclopedias such as Wikipedia, which are freely accessibl
Bit strange: this sounds like they are saying people know this because of wikipedia, rather than from lived experience.
IPEEE denotes the exogenousdistribution
Q: Can we get a definition of this?
to our real worldintuition since there is a bidirected edge X ↔ Y ∈ G(M2) with E3 being the underlying confounder
**Intuition: ** whatever explains GDP, we call E3, that also explains X and Y.
The following block paragraph serves as a summary
question: where does this paragraph come from? who wrote it?
we take the former perspectivepro causal AI/ML. We argue that the questions around causality can fuel research also on questions of recentdebates such as how much ‘real’ progress towards AGI has been made since the advent of large scale models
I would agree with this stance!
counteringopinions start to speak out against causal AI/ML (Bishop, 2021)
Should we read this paper as well? Is there an updated paper or opinion piece from these researchers about why causal AI/ML isn't needed?
Introduction of the RoBERTa improved analysis and training approach to BERT NLP models.
(Chen, NeurIPS, 2021) Che1, Lu, Rajeswaran, Lee, Grover, Laskin, Abbeel, Srinivas, and Mordatch. "Decision Transformer: Reinforcement Learning via Sequence Modeling". Arxiv preprint rXiv:2106.01345v2, June, 2021.
Quickly a very influential paper with a new idea of how to learn generative models of action prediction using SARSA training from demonstration trajectories. No optimization of actions or rewards, but target reward is an input.
Kallus, N. (2020). DeepMatch: Balancing deep covariate representations for causal inference using adversarial training. In I. H. Daumé, & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning. In Proceedings of Machine Learning Research: vol. 119 (pp. 5067–5077). PMLR
Using adversarial deep learning approaches to get a better correction for causal inference from observational data.
"Causal Deep Learning" Authors:Jeroen Berrevoets, Krzysztof Kacprzyk, Zhaozhi Qian, Mihaela van der Schaar
Very general and ambitious approach for representing the full continuous conceptual spectrum of Pearl's Causal Ladder, and ability to model and learning parts of this from Data.
Discussed by Prof. van der Shaar at ICML2023 workshop on Counterfactuals.
Performing optimization in the latent space can more flexibly model underlying data distributions than mechanistic approaches in the original hypothesis space. However, extrapolative prediction in sparsely explored regions of the hypothesis space can be poor. In many scientific disciplines, hypothesis spaces can be vastly larger than what can be examined through experimentation. For instance, it is estimated that there are approximately 1060 molecules, whereas even the largest chemical libraries contain fewer than 1010 molecules12,159. Therefore, there is a pressing need for methods to efficiently search through and identify high-quality candidate solutions in these largely unexplored regions.
Question: how does this notion of hypothesis space relate to causal inference and reasoning?
Wang et. al. "Scientific discovery in the age of artificial intelligence", Nature, 2023.
A paper about the current state of using AI/ML for scientific discovery, connected with the AI4Science workshops at major conferences.
(NOTE: since Springer/Nature don't allow public pdfs to be linked without a paywall, we can't use hypothesis directly on the pdf of the paper, this link is to the website version of it which is what we'll use to guide discussion during the reading group.)
Petersen, B. K. et al. Deep symbolic regression: recovering mathematical expressions from data via risk-seeking policy gradients. In International Conference on Learning Representations (2020).
Description: Reinforcement learning uses neural networks to generate a mathematical expression sequentially by adding mathematical symbols from a predefined vocabulary and using the learned policy to decide which notation symbol to be added next. The mathematical formula is represented as a parse tree. The learned policy takes the parse tree as input to determine what leaf node to expand and what notation (from the vocabulary) to add.
[ Bengio, The Consciousness Prior, Arxiv, 2018]
Causal Deep Learning Authors:Jeroen Berrevoets, Krzysztof Kacprzyk, Zhaozhi Qian, Mihaela van der Schaar
Very general and ambitious approach for representing the full continuous conceptual spectrum of Pearl's Causal Ladder, and ability to model and learning parts of this from Data.
Discussed by Prof. van der Shaar at ICML2023 workshop on Counterfactuals.
(Cousineau,Verter, Murphy and Pineau, 2023) " Estimating causal effects with optimization-based methods: A review and empirical comparison"
To avoid such bias, a fundamental aspect in the research design of studies of causalinference is the identification strategy: a clear definition of the sources of variation in the datathat can be used to estimate the causal effect of interest.
To avoid making false conclusions, studies must identify all the sources of variation. Is this is even possible in most caes?
Matching: This approach seeks to replicate a balanced experimental design usingobservational data by finding close matches between pairs or groups of units andseparating out the ones that received a specified treatment from those that did not, thusdefining the control groups.
Matching approach to dealing with sampling bias. Basically use some intrinsic, or other, metric about the situations to cluster them so that "similar" situations will be dealt with similiarly. Then analysis is carried out on those clusters. Number of clusters has to be defined, some method, like k-means, if often used. Depends a lot on the similarity metric, the clustering approach, other assumptions
Terwiesch, 2022 - "A review of Empircal Operations Managment over the Last Two Decades" Listed as an important review of methods for addressing biases in Operations management by explicitly addressing causality.
"Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning" Yuejiang Liu1, 2,* YUEJIANG.LIU@EPFL.CH Alexandre Alahi2 ALEXANDRE.ALAHI@EPFL.CH Chris Russell1 CMRUSS@AMAZON.DE Max Horn1 HORNMAX@AMAZON.DE Dominik Zietlow1 ZIETLD@AMAZON.DE Bernhard Sch ̈olkopf1, 3 BS@TUEBINGEN.MPG.DE Francesco Locatello1 LOCATELF@AMAZON.DE
Shayan Shirahmad Gale Bagi, Zahra Gharaee, Oliver Schulte, and Mark Crowley Generative Causal Representation Learning for Out-of-Distribution Motion Forecasting In International Conference on Machine Learning (ICML). Honolulu, Hawaii, USA. Jul, 2023.
Wu, Prabhumoye, Yeon Min, Bisk, Salakhutdinov, Azaria, Mitchell and Li. "SPRING: GPT-4 Out-performs RL Algorithms byStudying Papers and Reasoning". Arxiv preprint arXiv:2305.15486v2, May, 2023.