our GTPO hybrid advantage formulation eliminates the advantage misalignment problem
大多数人认为在强化学习中,优势函数的计算和优化是一个相对直接的过程,但作者指出存在'优势不匹配问题',并提出了GTPO混合优势公式来解决它。这挑战了强化学习中的基本假设,表明即使是优势函数这样的核心概念也需要仔细设计才能在多轮任务中有效工作。
our GTPO hybrid advantage formulation eliminates the advantage misalignment problem
大多数人认为在强化学习中,优势函数的计算和优化是一个相对直接的过程,但作者指出存在'优势不匹配问题',并提出了GTPO混合优势公式来解决它。这挑战了强化学习中的基本假设,表明即使是优势函数这样的核心概念也需要仔细设计才能在多轮任务中有效工作。
Hence, in a translanguaging classroom, one language is used to reinforce the performance in otherlanguages and students learn many new words from each other which enrich their vocabulary.
Finding: Reinforcement across languages enriches vocabulary and performance (p. 9). Why it matters: Ties CUP to observed learning gains: useful for my analysis.
for - like - Michael Levins - Richard Sutton - youtube interview
Summary - interesting talk on learning - reminds me of Michael Levin's work - the priority is on goal directed activity
Playing Atari with Deep Reinforcement Learning 19 Dec 2013 · Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller
The paper from 2013 that introduced the DQN algorithm for using Deep Learning with Reinforcement Learning to play Atari game.
Welcome to the Era of ExperienceDavid Silver, Richard S. Sutton
Welcome to the Era of Experience David Silver, Richard S. Sutton
"This is a preprint of a chapter that will appear in the book Designing an Intelligence, published by MIT Press"
MAPPING SOCIAL CHOICE THEORY TO RLHF Jessica Dai and Eve Fleisig ICLR Workshop on Reliable and Responsible Foundation Models 2024
Nice overview of how social choice theory has been connected to RLHF and AI alignment ideas.
Most contemporary implementations of Monte Carlo tree search are based on some variant of UCT
The UCB algorithm for bandits comes back again as UCT to form the basis for model estimation via MCTS
The main difficulty in selecting child nodes is maintaining some balance between the exploitation of deep variants after moves with high average win rate and the exploration of moves with few simulations.
Tree search makes this tradeoff very clear, how many paths will you explore before you stop and use the knowledge you already have?
The summary paper for AlphaGo.
Wikipedia: AlphaZero
2024 paper arguing that other methods beyond PPO could be better for "value alignment" of LLMs
Paper "Deep Reinforcement Learning that Matters" on evaluating RL algorithms.
T. Herlau, "Moral Reinforcement Learning Using Actual Causation," 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 2022, pp. 179-185, doi: 10.1109/ICCCR54399.2022.9790262. keywords: {Digital control;Ethics;Costs;Philosophical considerations;Toy manufacturing industry;Reinforcement learning;Forestry;Causality;Reinforcement learning;Actual Causation;Ethical reinforcement learning}
Can model-free reinforcement learning explain deontological moraljudgments?Alisabeth AyarsUniversity of Arizona, Dept. of Psychology, Tucson, AZ, USA
Reading this one on Nov 27, 2023 for the reading group.
Reading this one on Nov 27, 2023 for the reading group.
(Chen, NeurIPS, 2021) Che1, Lu, Rajeswaran, Lee, Grover, Laskin, Abbeel, Srinivas, and Mordatch. "Decision Transformer: Reinforcement Learning via Sequence Modeling". Arxiv preprint rXiv:2106.01345v2, June, 2021.
Quickly a very influential paper with a new idea of how to learn generative models of action prediction using SARSA training from demonstration trajectories. No optimization of actions or rewards, but target reward is an input.
Wu, Prabhumoye, Yeon Min, Bisk, Salakhutdinov, Azaria, Mitchell and Li. "SPRING: GPT-4 Out-performs RL Algorithms byStudying Papers and Reasoning". Arxiv preprint arXiv:2305.15486v2, May, 2023.
Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.
Them's fighten' words!
I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.
Training language models to follow instructionswith human feedback
Original Paper for discussion of the Reinforcement Learning with Human Feedback algorithm.
[Kapturowski, DeepMind, Sep 2022] "Human-level Atari 200x Faster"
Improving the 2020 Agent57 performance to be more efficeint.
Adaptive Stress Testing with Reward Augmentation for Autonomous Vehicle Validation
IMPALA: Scalable Distributed Deep-RL with Importance WeightedActor-Learner Architectures
(Espeholt, ICML, 2018) "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures"
This paper introduced the DPG Algorithm
Link to page with information about the paper: https://openreview.net/forum?id=rJeXCo0cYX
Yann LeCun released his vision for the future of Artificial Intelligence research in 2022, and it sounds a lot like Reinforcement Learning.
Paper that evaluated the existing Double Q-Learning algorithm on the new DQN approach and validated that it is very effective in the Deep RL realm.
This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.
This famous paper gives a great review of the DQN algorithm a couple years after it changed everything in Deep RL. It compares six different extensions to DQN for Deep Reinforcement Learning, many of which have now become standard additions to DQN and other Deep RL algorithms. It also combines all of them together to produce the "rainbow" algorithm, which outperformed many other models for a while.
Arxiv paper from 2021 on reinforcement learning in a scenario where your aim is to learn a workable POMDP policy, but you start with a fully observable MDP and adjust it over time towards a POMDP.
Paper that introduced the PPO algorithm. PPO is, in a way, a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.
TRPO defines the problem as a straight optimization problem, no learning is actually involved.
Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.
Introduction of VPT : New semi-supervied pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.
Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
Response paper to DQN showing that well designed Value Function Approximations can also do well at these complex tasks without the use of Deep Learning
A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!
Tom Schaul, John Quan, Ioannis Antonoglou and David Silver. "PRIORITIZED EXPERIENCE REPLAY", ICLR, 2016.
Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!
LeBlanc, D. G., & Lee, G. (2021). General Deep Reinforcement Learning in NES Games. Canadian AI 2021. Canadian Artificial Intelligence Association (CAIAC). https://doi.org/10.21428/594757db.8472938b
Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.
New supervised pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.
reinforcement-learning foundation-models pretrained-models proj-minerl minecraft
If my patient notes don’t include a question I haven’t yet asked, ChatGPT’s output will encourage me to keep missing that question. Like with my young female patient who didn’t know she was pregnant. If a possible ectopic pregnancy had not immediately occurred to me, ChatGPT would have kept enforcing that omission, only reflecting back to me the things I thought were obvious — enthusiastically validating my bias like the world’s most dangerous yes-man.
Things missing in a prompt will not result from a prompt. This may reinforce one's own blind spots / omissions, lowering the probability of an intuitive leap to other possibilities. The machine helps you search under the light you switched on with your prompt. Regardless of whether you're searching in the right place.
asks for the Minecraft domain.
They demonstrate the model on a "minecraft-like" domain (introduced earlier by someone else) where there are resources in the world and the agent has tasks.
Definition 3.2 (simple reward machine).
The MDP does not change, it's dynamics are the same, with or without the RM, as they are with or without a standard reward model. Additionally, the rewards from the RM can be non-Markovian with respect to the MDP because they inherently have a kind of memory or where you've been, limited to the agents "movement" (almost "in it's mind") about where it is along the goals for this task.
e thenshow that an RM can be interpreted as specifying a single reward function over a largerstate space, and consider types of reward functions that can be expressed using RMs
So by specifying a reward machine you are augmenting the state space of the MDP with higher level goals/subgoals/concepts that provide structure about what is good and what isn't.
However, an agent that hadaccess to the specification of the reward function might be able to use such information tolearn optimal policies faster.
Fascinating idea, why not? Why are we hiding the reward from the agent really?
Reward Machines: Exploiting Reward FunctionStructure in Reinforcement Learning
[Icarte, JAIR, 2022] "Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning"
Using Reward Machines for High-Level Task Specificationand Decomposition in Reinforcement Learning
[Icarte, PMLR, 2018] "Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning"
Lee et. al. - NeurIPS 2022 "Multi-Game Decision Transformers"
[Neumann, Gros, NeurIPS, 2022] - "SCALING LAWS FOR A MULTI-AGENT REINFORCEMENT LEARNING MODEL"
We study whether sequence modelingcan perform policy optimization by evaluating Decision Transformer on offline RL benchmarks
AAAI 2022 Paper : Decentralized Mean Field Games Happy to discuss online.
S. Ganapathi Subramanian, M. Taylor, M. Crowley, and P. Poupart., “Decentralized mean field games,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-2022), vol. 36, pp. 9439–9447, February 2022. 1.
A recent overview of RL methods used for autonomous driving.
Discussion on
Bellinger C, Drozdyuk A, Crowley M, Tamblyn I. Balancing Information with Observation Costs in Deep Reinforcement Learning. Proceedings of the Canadian Conference on Artificial Intelligence [Internet]. 2022 May 27; Available from: https://caiac.pubpub.org/pub/0jmy7gpd
Another piece to the "what can we do with eligibility traces" puzzle for Deep RL.
Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.
Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.
Hypothesis page to discuss this high level description of DeepMind's new Gato framework.
The paper that introduced the MineRL challenge dataset.
reinforcement
"Reinforcement means to the act of reinforcing."
Palminteri, S. (2021). Choice-confirmation bias and gradual perseveration in human reinforcement learning [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/dpqj6
Chadi, M.-A., & Mousannif, H. (2021). Reinforcement Learning Based Decision Support Tool For Epidemic Control [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/tcr8s
Using chemicals to improve our economy of attention and become emotionally "fitter" is an option that penetrated public consciousness some time ago.
Same is true of reinforcement learning algorithms.
Ozaita, J., Baronchelli, A., & Sánchez, A. (2020). The emergence of segregation: From observable markers to group specific norms. ArXiv:2009.05354 [Physics, q-Bio]. http://arxiv.org/abs/2009.05354
Ludwig, V. U., Brown, K. W., & Brewer, J. A. (2020). Self-Regulation Without Force: Can Awareness Leverage Reward to Drive Behavior Change? Perspectives on Psychological Science, 1745691620931460. https://doi.org/10.1177/1745691620931460
Harvey, A., Armstrong, C. C., Callaway, C. A., Gumport, N. B., & Gasperetti, C. E. (2020). COVID-19 Prevention via the Science of Habit Formation [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/57jyg
Radulescu, A., Holmes, K., & Niv, Y. (2020). On the convergent validity of risk sensitivity measures [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/qdhx4
Hertz, U. (2020). Cognitive learning processes account for asymmetries in adaptations to new social norms [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/7thku
Liu, L., Wang, X., Tang, S., & Zheng, Z. (2020). Complex social contagion induces bistability on multiplex networks. ArXiv:2005.00664 [Physics]. http://arxiv.org/abs/2005.00664
Ting, C., Palminteri, S., Lebreton, M., & Engelmann, J. B. (2020, March 25). The elusive effects of incidental anxiety on reinforcement-learning. https://doi.org/10.31234/osf.io/7d4tc MLA
深度强化学习综述
深度强化学习综述
reinforcement-learning code and paper tutorials
We present MILABOT: a deep reinforcement learning chatbot developed by theMontreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prizecompetition. MILABOT is capable of conversing with humans on popular smalltalk topics through both speech and text. The system consists of an ensemble ofnatural language generation and retrieval models, including template-based models,bag-of-words models, sequence-to-sequence neural network and latent variableneural network models. By applying reinforcement learning to crowdsourced dataand real-world user interactions, the system has been trained to select an appropriateresponse from the models in its ensemble. The system has been evaluated throughA/B testing with real-world users, where it performed significantly better thanmany competing systems. Due to its machine learning architecture, the system islikely to improve with additional data
Think of all the hard work and the sweat you put in to the things that your proudest of.
Always feels good to say, "I worked out today!"