- Jul 2024
-
en.wikipedia.org en.wikipedia.org
-
Most contemporary implementations of Monte Carlo tree search are based on some variant of UCT
The UCB algorithm for bandits comes back again as UCT to form the basis for model estimation via MCTS
-
The main difficulty in selecting child nodes is maintaining some balance between the exploitation of deep variants after moves with high average win rate and the exploration of moves with few simulations.
Tree search makes this tradeoff very clear, how many paths will you explore before you stop and use the knowledge you already have?
-
-
-
The summary paper for AlphaGo.
-
-
en.wikipedia.org en.wikipedia.org
-
Wikipedia: AlphaZero
-
-
arxiv.org arxiv.org
-
2024 paper arguing that other methods beyond PPO could be better for "value alignment" of LLMs
Tags
Annotators
URL
-
-
arxiv.org arxiv.org
-
Paper "Deep Reinforcement Learning that Matters" on evaluating RL algorithms.
Tags
Annotators
URL
-
- Feb 2024
-
arxiv.org arxiv.org
-
T. Herlau, "Moral Reinforcement Learning Using Actual Causation," 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 2022, pp. 179-185, doi: 10.1109/ICCCR54399.2022.9790262. keywords: {Digital control;Ethics;Costs;Philosophical considerations;Toy manufacturing industry;Reinforcement learning;Forestry;Causality;Reinforcement learning;Actual Causation;Ethical reinforcement learning}
-
-
pdf.sciencedirectassets.com pdf.sciencedirectassets.com
-
Can model-free reinforcement learning explain deontological moraljudgments?Alisabeth AyarsUniversity of Arizona, Dept. of Psychology, Tucson, AZ, USA
-
- Nov 2023
-
proceedings.mlr.press proceedings.mlr.press
-
Reading this one on Nov 27, 2023 for the reading group.
-
-
proceedings.neurips.cc proceedings.neurips.cc
-
Reading this one on Nov 27, 2023 for the reading group.
-
- Oct 2023
-
arxiv.org arxiv.org
-
(Chen, NeurIPS, 2021) Che1, Lu, Rajeswaran, Lee, Grover, Laskin, Abbeel, Srinivas, and Mordatch. "Decision Transformer: Reinforcement Learning via Sequence Modeling". Arxiv preprint rXiv:2106.01345v2, June, 2021.
Quickly a very influential paper with a new idea of how to learn generative models of action prediction using SARSA training from demonstration trajectories. No optimization of actions or rewards, but target reward is an input.
-
-
arxiv.org arxiv.org
-
Wu, Prabhumoye, Yeon Min, Bisk, Salakhutdinov, Azaria, Mitchell and Li. "SPRING: GPT-4 Out-performs RL Algorithms byStudying Papers and Reasoning". Arxiv preprint arXiv:2305.15486v2, May, 2023.
-
Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.
Them's fighten' words!
I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.
-
-
arxiv.org arxiv.org
-
Training language models to follow instructionswith human feedback
Original Paper for discussion of the Reinforcement Learning with Human Feedback algorithm.
-
-
arxiv.org arxiv.org
-
[Kapturowski, DeepMind, Sep 2022] "Human-level Atari 200x Faster"
Improving the 2020 Agent57 performance to be more efficeint.
-
- Sep 2023
-
arxiv.org arxiv.org
-
Adaptive Stress Testing with Reward Augmentation for Autonomous Vehicle Validation
-
- Jul 2023
-
proceedings.mlr.press proceedings.mlr.press
-
IMPALA: Scalable Distributed Deep-RL with Importance WeightedActor-Learner Architectures
(Espeholt, ICML, 2018) "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures"
-
-
proceedings.mlr.press proceedings.mlr.press
-
This paper introduced the DPG Algorithm
-
-
openreview.net openreview.net
-
Link to page with information about the paper: https://openreview.net/forum?id=rJeXCo0cYX
-
-
openreview.net openreview.net
-
Yann LeCun released his vision for the future of Artificial Intelligence research in 2022, and it sounds a lot like Reinforcement Learning.
Tags
Annotators
URL
-
-
www.cs.toronto.edu www.cs.toronto.edudqn.pdf1
-
The paper that introduced the DQN algorithm for using Deep Learning with Reinforcement Learning to play Atari game.
-
-
arxiv.org arxiv.org
-
Paper that evaluated the existing Double Q-Learning algorithm on the new DQN approach and validated that it is very effective in the Deep RL realm.
-
-
-
This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.
Tags
Annotators
URL
-
-
arxiv.org arxiv.org
-
This famous paper gives a great review of the DQN algorithm a couple years after it changed everything in Deep RL. It compares six different extensions to DQN for Deep Reinforcement Learning, many of which have now become standard additions to DQN and other Deep RL algorithms. It also combines all of them together to produce the "rainbow" algorithm, which outperformed many other models for a while.
-
-
arxiv.org arxiv.org
-
Arxiv paper from 2021 on reinforcement learning in a scenario where your aim is to learn a workable POMDP policy, but you start with a fully observable MDP and adjust it over time towards a POMDP.
Tags
Annotators
URL
-
-
arxiv.org arxiv.org
-
Paper that introduced the PPO algorithm. PPO is, in a way, a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.
TRPO defines the problem as a straight optimization problem, no learning is actually involved.
-
-
arxiv.org arxiv.org
-
Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.
Introduction of VPT : New semi-supervied pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.
-
-
arxiv.org arxiv.org
-
Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
Response paper to DQN showing that well designed Value Function Approximations can also do well at these complex tasks without the use of Deep Learning
A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!
-
-
arxiv.org arxiv.org
-
Tom Schaul, John Quan, Ioannis Antonoglou and David Silver. "PRIORITIZED EXPERIENCE REPLAY", ICLR, 2016.
-
- Jun 2023
-
www.fandm.edu www.fandm.edu
-
Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!
-
-
assets.pubpub.org assets.pubpub.org
-
LeBlanc, D. G., & Lee, G. (2021). General Deep Reinforcement Learning in NES Games. Canadian AI 2021. Canadian Artificial Intelligence Association (CAIAC). https://doi.org/10.21428/594757db.8472938b
-
- Apr 2023
-
arxiv.org arxiv.org
-
Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.
New supervised pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.
reinforcement-learning foundation-models pretrained-models proj-minerl minecraft
-
- Mar 2023
-
arxiv.org arxiv.org
-
asks for the Minecraft domain.
They demonstrate the model on a "minecraft-like" domain (introduced earlier by someone else) where there are resources in the world and the agent has tasks.
-
- Feb 2023
-
arxiv.org arxiv.org
-
Definition 3.2 (simple reward machine).
The MDP does not change, it's dynamics are the same, with or without the RM, as they are with or without a standard reward model. Additionally, the rewards from the RM can be non-Markovian with respect to the MDP because they inherently have a kind of memory or where you've been, limited to the agents "movement" (almost "in it's mind") about where it is along the goals for this task.
-
e thenshow that an RM can be interpreted as specifying a single reward function over a largerstate space, and consider types of reward functions that can be expressed using RMs
So by specifying a reward machine you are augmenting the state space of the MDP with higher level goals/subgoals/concepts that provide structure about what is good and what isn't.
-
However, an agent that hadaccess to the specification of the reward function might be able to use such information tolearn optimal policies faster.
Fascinating idea, why not? Why are we hiding the reward from the agent really?
-
Reward Machines: Exploiting Reward FunctionStructure in Reinforcement Learning
[Icarte, JAIR, 2022] "Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning"
-
-
proceedings.mlr.press proceedings.mlr.press
-
Using Reward Machines for High-Level Task Specificationand Decomposition in Reinforcement Learning
[Icarte, PMLR, 2018] "Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning"
-
- Dec 2022
-
arxiv.org arxiv.org
-
Lee et. al. - NeurIPS 2022 "Multi-Game Decision Transformers"
-
-
arxiv.org arxiv.org
-
[Neumann, Gros, NeurIPS, 2022] - "SCALING LAWS FOR A MULTI-AGENT REINFORCEMENT LEARNING MODEL"
-
- Sep 2022
-
arxiv.org arxiv.org
-
We study whether sequence modelingcan perform policy optimization by evaluating Decision Transformer on offline RL benchmarks
-
-
arxiv.org arxiv.org
-
AAAI 2022 Paper : Decentralized Mean Field Games Happy to discuss online.
S. Ganapathi Subramanian, M. Taylor, M. Crowley, and P. Poupart., “Decentralized mean field games,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-2022), vol. 36, pp. 9439–9447, February 2022. 1.
Tags
Annotators
URL
-
- Jul 2022
-
ieeexplore.ieee.org ieeexplore.ieee.org
-
A recent overview of RL methods used for autonomous driving.
-
- Jun 2022
-
assets.pubpub.org assets.pubpub.org
-
Discussion on
Bellinger C, Drozdyuk A, Crowley M, Tamblyn I. Balancing Information with Observation Costs in Deep Reinforcement Learning. Proceedings of the Canadian Conference on Artificial Intelligence [Internet]. 2022 May 27; Available from: https://caiac.pubpub.org/pub/0jmy7gpd
-
- May 2022
-
www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov
-
Another piece to the "what can we do with eligibility traces" puzzle for Deep RL.
-
-
arxiv.org arxiv.org
-
Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.
-
-
arxiv.org arxiv.org
-
Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.
-
-
storage.googleapis.com storage.googleapis.com
-
Hypothesis page to discuss this high level description of DeepMind's new Gato framework.
-
- Mar 2022
-
arxiv.org arxiv.org
-
The paper that introduced the MineRL challenge dataset.
Tags
Annotators
URL
-
- Jul 2021
-
psyarxiv.com psyarxiv.com
-
Palminteri, S. (2021). Choice-confirmation bias and gradual perseveration in human reinforcement learning [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/dpqj6
-
- Jun 2021
-
-
Chadi, M.-A., & Mousannif, H. (2021). Reinforcement Learning Based Decision Support Tool For Epidemic Control [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/tcr8s
-
- Mar 2021
-
www.opendemocracy.net www.opendemocracy.net
-
Using chemicals to improve our economy of attention and become emotionally "fitter" is an option that penetrated public consciousness some time ago.
Same is true of reinforcement learning algorithms.
-
- Sep 2020
-
-
Ozaita, J., Baronchelli, A., & Sánchez, A. (2020). The emergence of segregation: From observable markers to group specific norms. ArXiv:2009.05354 [Physics, q-Bio]. http://arxiv.org/abs/2009.05354
-
-
journals.sagepub.com journals.sagepub.com
-
Ludwig, V. U., Brown, K. W., & Brewer, J. A. (2020). Self-Regulation Without Force: Can Awareness Leverage Reward to Drive Behavior Change? Perspectives on Psychological Science, 1745691620931460. https://doi.org/10.1177/1745691620931460
-
- May 2020
-
-
Radulescu, A., Holmes, K., & Niv, Y. (2020). On the convergent validity of risk sensitivity measures [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/qdhx4
-
-
-
psyarxiv.com psyarxiv.com
-
Hertz, U. (2020). Cognitive learning processes account for asymmetries in adaptations to new social norms [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/7thku
-
- Apr 2020
-
-
Ting, C., Palminteri, S., Lebreton, M., & Engelmann, J. B. (2020, March 25). The elusive effects of incidental anxiety on reinforcement-learning. https://doi.org/10.31234/osf.io/7d4tc MLA
-
- Mar 2019
-
cjc.ict.ac.cn cjc.ict.ac.cn
-
深度强化学习综述
-
-
cjc.ict.ac.cn cjc.ict.ac.cn
-
深度强化学习综述
-
-
github.com github.com
-
reinforcement-learning code and paper tutorials
-
- Feb 2019
-
gitee.com gitee.com
-
We present MILABOT: a deep reinforcement learning chatbot developed by theMontreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prizecompetition. MILABOT is capable of conversing with humans on popular smalltalk topics through both speech and text. The system consists of an ensemble ofnatural language generation and retrieval models, including template-based models,bag-of-words models, sequence-to-sequence neural network and latent variableneural network models. By applying reinforcement learning to crowdsourced dataand real-world user interactions, the system has been trained to select an appropriateresponse from the models in its ensemble. The system has been evaluated throughA/B testing with real-world users, where it performed significantly better thanmany competing systems. Due to its machine learning architecture, the system islikely to improve with additional data
-