327 Matching Annotations
  1. Jul 2023
    1. ot sufficient to simply choose a fixed penalty coefficient β and optimize the penalizedobjective Equation (5) with SGD
    2. objective function (the “surrogate” objective) is maximized

      PPO is a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.

      TRPO defines the problem as a straight optimization problem, no learning is actually involved.

    3. Generalizingthis choice, we can use a truncated version of generalized advantage estimation
    4. Without a constraint, maximization of LCP I would lead to an excessively large policyupdate; hence, we now consider how to modify the objective, to penalize changes to the policy thatmove rt(θ) away from 1

      The policy iteration objective proposes steps which are too large. It uses a likelihood ratio of the current policy with and older version of the policy multiplied by the Advantage function. So, it uses the change in the policy probability for an action to weight the Advantage function.

    5. our goalof a first-order algorithm that emulates the monotonic improvement of TRPO,
    6. A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments isshown below. Each iteration, each of N (parallel) actors collect T timesteps of data. Then weconstruct the surrogate loss on these N T timesteps of data, and optimize it with minibatch SGD
    7. Thefirst term inside the min is LCP I . The second term, clip(rt(θ), 1 − , 1 + ) ˆAt, modifies the surrogateobjective by clipping the probability ratio, which removes the incentive for moving rt outside of theinterval [1 − , 1 + ]. Finally, we take the minimum of the clipped and unclipped objective, so thefinal objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective

      The "clip" function cuts off the probability ratio output so that some changes in Advantage are ignored.

    8. Clipped Surrogate Objective
    9. Wecan see that LCLIP is a lower bound on LCP I , with a penalty for having too large of a policyupdate

      The clipped loss is a lower bound on the actual loss defined in TRPO. So it is simpler to compute, and will provide some guidance at least, it will never overestimate the true loss.

    10. hese methods havethe stability and reliability of trust-region methods but are much simpler to implement, requiringonly few lines of code change to a vanilla policy gradient implementation, applicable in more generalsetting
    11. shows howseveral objectives vary as we interpolate along the policy update direction
    12. Surrogate objectives, as we interpolate between the initial policy parameter θold, and the updatedpolicy parameter, which we compute after one iteration of PPO.

      Another figure to show intuition for the approach by showing how each component changes with respect to following the policy update along the gradient direction.

    13. lower bound (i.e., a pessimistic bound
    14. Paper that introduced the PPO algorithm. PPO is, in a way, a response to the TRPO algorithm, trying to use the core idea but implement a more efficient and simpler algorithm.

      TRPO defines the problem as a straight optimization problem, no learning is actually involved.

    15. The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e.,solving the unconstrained optimization problemmaximizeθˆEt[ πθ(at | st)πθold (at | st) ˆAt − β KL[πθold (· | st), πθ(· | st)]](5)for some coefficient β

      This parameter \Beta is a bit mysterious. PPO works very well generally, but setting \Beta is tricky, and incluences other parts of the algorithm.

    1. New transitions
    2. bias towards re-cent transitions
    3. samples transitions with probability ptrelative to the last encountered absolute TD error
    4. RMSprop
    5. his means thatin the loss above, the time index t will be a random time in-dex from the last million transitions, rather than the currenttime.
    6. Multi-step learning.
    7. Prioritized replay.
    8. Prioritized
    9. parameters θ of the online network (which is alsoused to select actions
    10. ablation
    11. θ represents the parame-ters of a target network
    12. a periodic copy of the online net-work which is not directly optimized.
    13. Noisy Nets. The limitations of exploring using -greedypolicies are clear in games such as Montezuma’s Revenge,where many actions must be executed to collect the first re-ward
    14. t is a time step randomly picked from the replaymemory
    15. DDQN
    16. This famous paper gives a great review of the DQN algorithm a couple years after it changed everything in Deep RL. It compares six different extensions to DQN for Deep Reinforcement Learning, many of which have now become standard additions to DQN and other Deep RL algorithms. It also combines all of them together to produce the "rainbow" algorithm, which outperformed many other models for a while.

    1. This means that the target values are constrained to change slowly, greatlyimproving the stability of learning.
    2. A major challenge of learning in continuous action spaces is exploration. An advantage of off-policies algorithms such as DDPG is that we can treat the problem of exploration independentlyfrom the learning algorithm.

      Learning and Exploration are handled seperately.

    3. but modified for actor-critic and using “soft” target updates, rather thandirectly copying the weights.
    4. his simple change moves the relatively unstable problem oflearning the action-value function closer to the case of supervised learning, a problem for whichrobust solutions exist.
    5. One approach to this problem is to manually scale the features so they are in similar ranges acrossenvironments and units. We address this issue by adapting a recent technique from deep learningcalled batch normalization
    6. minimize covariance shif
    7. This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.

    8. normalizes each dimensionacross the samples in a minibatch to have unit mean and variance
    1. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craftdiamond tools, which can take proficient humans upwards of 20 minutes (24,000environment actions) of gameplay to accomplish
    2. Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.

      Introduction of VPT : New semi-supervied pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.

    3. e extend the internet-scalepretraining paradigm to sequential decision domains through semi-supervisedimitation learning wherein agents learn to act by watching online unlabeled videos.
    1. an agent isinstructed to obtain a desired goal item

      Problem: Agent must complete the instructed task in MineCraft

    2. urriculum learning (at least using current RL methods) is that the agentachieves a small success probability (within available/reasonable compute) on a new task aftermastering a previous task.

      Curriculum Learning

    3. We study curriculum learning on a set of goal-conditioned Minecraft tasks, in which the agent istasked to collect one out of a set of 107 items from the Minecraft tech tree
    4. Results

      Experiments: They compared a variety of policies and training approachs

    5. It has 5 minutes (1500 time steps) to complete the task and obtains areward of +1 upon success. After each success or failure a new task is selected without resettingthe world or respawning the agent
      • Agent has 5 min to find item
      • Next item chosen without resetting world
    6. Simon Says”
    7. Learning progress curriculum

      Approach: Curriculum Learning

    1. Arxiv paper from 2021 on reinforcement learning in a scenario where your aim is to learn a workable POMDP policy, but you start with a fully observable MDP and adjust it over time towards a POMDP.

    1. Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"

      Response paper to DQN showing that well designed Value Function Approximations can also do well at these complex tasks without the use of Deep Learning

      A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!

    1. Few-Shot (FS) - the model is given a few demonstrations of the task at inference time asconditioning [ RWC+19 ], but no weights are updated

      hints are given but the model is not updated

  2. Jun 2023
    1. fuzzy

      fuzzy!

    2. [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess,Rewon Child, Scott Gray, Alec Ra

      Justification for low learning rate in large language models.

    3. As found in [ KMH+20 , MKAT18 ], larger models can typically use a larger batch size, but requirea smaller learning rate. We measure the gradient noise scale during training and use it to guideour choice of batch size [MKAT18 ]. Table A.1 shows the parameter settings we used. To train thelarger models without running out of memory, we use a mixture of model parallelism within eachmatrix multiply and model parallelism across the layers of the network. All models were trained onV100 GPU’s on part of a high-bandwidth cluster. Details of the training process and hyperparametersettings are described in the appendix.

      Why is this?

    4. We use the same model and architecture as GPT-2

      What do they mean by "model" here? If they have retrained on more data, with a slightly different architecture, then the model weights after training must be different.

    1. While zero-shot performance establishes a baseline of thepotential performance of GPT-2 on many tasks, it is notclear where the ceiling is with finetuning.

      So finetuning could lead to better models.

    2. 13.19%

      that's a lot!

    3. The Bloom filterswere constructed such that the false positive rate is upperbounded by 1108 . We further verified the low false positiverate by generating 1M strings, of which zero were found bythe filter

      Bloom filters used to determine how much overlap there is between train and test set, to be more sure of their results.

    4. Bloom filters

      Bloom Filter:

      The high level idea is to map elements x∈X to values y=h(x)∈Y using a hash function h, and then test for membership of x' in X by checking whether y'=h(x')∈Y, and do that using multiple hash functions h.

      Bloom Filter - Wikipedia

    5. Recent work in computer vision has shown that common im-age datasets contain a non-trivial amount of near-duplicateimages. For instance CIFAR-10 has 3.3% overlap betweentrain and test images (Barz & Denzler, 2019). This results inan over-reporting of the generalization performance of ma-chine learning systems.

      CIFAR-10 performance results are overestimates since some of the training data is essentially in the test set.

    1. Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"

      A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!

    1. e also add 8,000 text-instructions generated bythe OpenAI API gpt-3.5-turbo model [38],

      how does this work? takes images as well as input?

    1. Paper from 2016 soon after DQN paper, about how to use eligbility traces to improve performance further.

    1. LeBlanc, D. G., & Lee, G. (2021). General Deep Reinforcement Learning in NES Games. Canadian AI 2021. Canadian Artificial Intelligence Association (CAIAC). https://doi.org/10.21428/594757db.8472938b

    1. introducing a unified framework that converts all text-basedlanguage problems into a text-to-text format

      this is their goal, to have a single model, including hyperparameters and setup, that can be used for any NLP task.

    2. Paper introducing the T5 Text-to-Text transformer mdoel from google. (Raffel, JMLR, 2020)

  3. Apr 2023
    1. The Annotated S4 Efficiently Modeling Long Sequences with Structured State Spaces Albert Gu, Karan Goel, and Christopher Ré.

      A new approach to transformers

    1. Efficiently Modeling Long Sequences with Structured State SpacesAlbert Gu, Karan Goel, and Christopher R ́eDepartment of Computer Science, Stanford University

    1. Bowen Baker et. al. (Open AI) "Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos" Arkiv, June 2022.

      New supervised pre-trained model for sequential decision making on Minecraft. Data are from human video playthroughs but are unlabelled.

      reinforcement-learning foundation-models pretrained-models proj-minerl minecraft

    1. In this way, an NTM can be thought of as simultaneously exploring all computational possibilities in parallel and selecting an accepting branch

      Non-deterministic Turing Machines are able to get lucky and choose the single path to the answer in polynomial time, or be given a "hint" or "proof" or "certificate" for that path. This isn't realistic, but it separates the difficulty of the problem of verifying a solution and finding one into two different tasks.

    2. Computational problems[edit] Intuitively, a computational problem is just a question that can be solved by an algorithm. For example, "is the natural number n {\displaystyle n} prime?" is a computational problem. A computational problem is mathematically represented as the set of answers to the problem. In the primality example, the problem (call it PRIME {\displaystyle {\texttt {PRIME}}} ) is represented by the set of all natural numbers that are prime: PRIME = { n ∈ N | n  is prime } {\displaystyle {\texttt {PRIME}}=\{n\in \mathbb {N} |n{\text{ is prime}}\}} . In the theory of computation, these answers are represented as strings; for example, in the primality example the natural numbers could be represented as strings of bits that represent binary numbers. For this reason, computational problems are often synonymously referred to as languages, since strings of bits represent formal languages (a concept borrowed from linguistics); for example, saying that the PRIME {\displaystyle {\texttt {PRIME}}} problem is in the complexity class NP is equivalent to saying that the language PRIME {\displaystyle {\texttt {PRIME}}} is in NP.

      Explanation of why computational complexity class proofs with Turing Machines use "strings" instead of algorithms or programs.

    1. Presburger arithmetic is much weaker than Peano arithmetic, which includes both addition and multiplication operations. Unlike Peano arithmetic, Presburger arithmetic is a decidable theory. This means it is possible to algorithmically determine, for any sentence in the language of Presburger arithmetic, whether that sentence is provable from the axioms of Presburger arithmetic. The asymptotic running-time computational complexity of this algorithm is at least doubly exponential, however, as shown by Fischer & Rabin (1974).

      This is an example of a decision problem that is at least doubly exponential \(2^{2^n}\). It is a simpler form of arithmetic where the Halting problem/incompleteness theorem does not apply.

    1. NP-hard if everything in NP can be transformed in polynomial time into it even though it may not be in NP

      Definition of NP-hard problem

    2. At present, all known algorithms for NP-complete problems require time that is superpolynomial in the input size, in fact exponential in O ( n k ) {\displaystyle O(n^{k})} [clarify] for some k > 0 {\displaystyle k>0} and it is unknown whether there are any faster algorithms.

      So how hard are NP-complete problems?

    3. The Subgraph Isomorphism problem is NP-complete. The graph isomorphism problem is suspected to be neither in P nor NP-complete, though it is in NP. This is an example of a problem that is thought to be hard, but is not thought to be NP-complete. This class is called NP-Intermediate problems and exists if and only if P≠NP.

      There might even be some problems in NP but not in P and that are not NP-complete.

    4. NP-complete problems are often addressed by using heuristic methods and approximation algorithms.

      usually solved with approximation algorithms

    1. my annotations for the OpenAI GPT4 info page.

    2. GPT-4 outperforms ChatGPT by scoring in higher approximate percentiles among test-takers.

      oh, great.

    3. 40% more likely to produce factual responses than GPT-3.5

      great, 40% more than what though?

    4. We used GPT-4 to help create training data for model fine-tuning and iterate on classifiers across training, evaluations, and monitoring.

      Interesting, you need to consider, is this like data augmentation, like bootstrapping, like adversarial training, or is it like overfitting to your data?

  4. Mar 2023
    1. asks for the Minecraft domain.

      They demonstrate the model on a "minecraft-like" domain (introduced earlier by someone else) where there are resources in the world and the agent has tasks.

  5. Feb 2023
    1. Definition 3.2 (simple reward machine).

      The MDP does not change, it's dynamics are the same, with or without the RM, as they are with or without a standard reward model. Additionally, the rewards from the RM can be non-Markovian with respect to the MDP because they inherently have a kind of memory or where you've been, limited to the agents "movement" (almost "in it's mind") about where it is along the goals for this task.

    2. e thenshow that an RM can be interpreted as specifying a single reward function over a largerstate space, and consider types of reward functions that can be expressed using RMs

      So by specifying a reward machine you are augmenting the state space of the MDP with higher level goals/subgoals/concepts that provide structure about what is good and what isn't.

    3. However, an agent that hadaccess to the specification of the reward function might be able to use such information tolearn optimal policies faster.

      Fascinating idea, why not? Why are we hiding the reward from the agent really?

    4. U is a finite set of states,

      Apply a set of logical rules to the state space to obtain a finite set of states.

    5. state-reward function,

      reward is a constant number assigned to each set of states

    6. Reward Machines: Exploiting Reward FunctionStructure in Reinforcement Learning

      [Icarte, JAIR, 2022] "Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning"

    1. Using Reward Machines for High-Level Task Specificationand Decomposition in Reinforcement Learning

      [Icarte, PMLR, 2018] "Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning"

    1. Bell’s theorem is aboutcorrelations (joint probabilities) of stochastic real variables and therefore doesnot apply to quantum theory, which neither describes stochastic motion nor usesreal-valued observables

      strong statement, what do people think about this? is it accepted by anyone or dismissed?

  6. Jan 2023
  7. www.cs.princeton.edu www.cs.princeton.edu
    1. "Finding Optimal Solutions to Rubik's Cub e Using Pattern Databases" by Richard E. Korf, AAAI 1997.

      The famous "Korf Algorithm" for finding the optimal solution to any Rubik's Cube state.

    1. make up facts less often

      but not "never"

    2. On prompts submitted by our customers to the API,[1

      really? so that's how they make money.

      Question: what kind of bias does this introduce into the model?

      • which topics and questions grt trained on?
      • what is the goal of training? truth? clickability?
    3. Blog post from OpenAI in Jan 2022 explaining some of the approaches they use to train, reduce and tube their LLM for particular tasks. This was all precursor to the ChatGPT system we now see.

  8. Dec 2022
  9. Nov 2022
    1. 10K

      Kind of ambiguous to use 10K when one of the most important variables is K.

    2. n embedding for each timestep is learned and added to eachtoken – note this is different than the standard positional embedding used by transformers, as onetimestep corresponds to three tokens

      one timestep corresponds to three tokens

    1. we propose the Transformer, a model architecture eschewing recurrence and insteadrelying entirely on an attention mechanism to draw global dependencies between input and output.The Transformer allows for significantly more parallelization a

      Using the attention mechanism to determine global dependencies between input and output instead of using recurrent links to past states. This is the essence of their new idea.

    1. "On the Opportunities and Risks of Foundation Models" This is a large report by the Center for Research on Foundation Models at Stanford. They are creating and promoting the use of these models and trying to coin this name for them. They are also simply called large pre-trained models. So take it with a grain of salt, but also it has a lot of information about what they are, why they work so well in some domains and how they are changing the nature of ML research and application.

  10. Sep 2022
    1. We study whether sequence modelingcan perform policy optimization by evaluating Decision Transformer on offline RL benchmarks
    1. AAAI 2022 Paper : Decentralized Mean Field Games Happy to discuss online.

      S. Ganapathi Subramanian, M. Taylor, M. Crowley, and P. Poupart., “Decentralized mean field games,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-2022), vol. 36, pp. 9439–9447, February 2022. 1.

  11. Jul 2022
    1. As a baseline model we took the feature representation from a large pre-trained CNN such as ResNet50, by using the model and excluding the final dense layer, and using this in place of our convolution layers. We had predicted that this would likely get us some performance, but would inherently be worse, since we had fixed some of our trainable parameters.

      They didn't try to train the CNN from scratch.

  12. Jun 2022
  13. May 2022
    1. Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.

    1. Question: What happened to Eligibility Traces in the Deep RL era? This paper highlights some of the reasons they are not used widely and proposes a way they could still be effective.

  14. Mar 2022
    1. Weak supervision also objectively identifies relevant morphological features from the tissue microenv-iornment without any a priori knowledge or subjective annotation. In three separate analyses, we showed thatour models can identify well-known morphological features and accordingly, has the capability of identify-ing new morphological features of diagnostic, prognostic, and therapeutic relevance.

      Their target images are very large and there is a known (supervised) label for the entire image, but no labels for parts of an image (e.g. where is the tumor exactly?). So the powerful property of their method is the ability to learn what parts of the image relate to the label on it's own.

  15. Jan 2022
    1. The Canadian experiment has been built, in large part, around the American experiment: They have the melting pot, we have the cultural mosaic; they have the free market, we have sensible regulation; they have “life, liberty and the pursuit of happiness,” we have “peace, order and good government.”

      I agree with this.

    2. Northrop Frye once defined a Canadian as “an American who rejects the Revolution.”

      I see what he means but I wouldn't go this far. Canadians do have a seperate cultural identity. It is defined by its lack of definiton and certainty, in contrast to American certainty. This ks why it isore resilient. It cannot have certainty because our nation was founded on "two solitudes" of French and English, Catholic and Protestant, and also the very different, though equally destructive relationship of the Eurooean colonizers with the Indigenous Peoples of Canada.

    3. A flaw lurked right at the core of the experiment, as flaws so often do in works of ambitious genius.

      The flaw was an assumption that everyone had the nation's best interests at heart, that they all wanted the same thing deep down.

    4. Difference is the core of the American experience. Difference is its genius. There has never been a country so comfortable with difference, so full of difference.

      Diversity is Strength. This is really one of their founding principles, even in its hypocrisy. For them the diversity was in religious faith and ways of thinking but did not include gender, ethnicity or anything else. In time this changed and it is the only reason America has done so well.

  16. Jul 2021
    1. Such a map, plus the universal property of AA A<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>A</mi></mrow><annotation encoding="application/x-tex">A</annotation></semantics></math>, is in fact enough to reconstruct the entire Turing structure of CC \mathsf{C}<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mstyle mathvariant="sans-serif"><mi>C</mi></mstyle></mrow><annotation encoding="application/x-tex">\mathsf{C}</annotation></semantics></math>.

      The minimal necessary to construct a Turing machine

    2. not necessarily extensional, only intensional)

      Whats the difference?