37 Matching Annotations
  1. Last 7 days
    1. Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.

      这是本文最令人震惊的发现:Claude 内部的情绪表征不只是「情绪的副产品」,而是因果性地影响模型是否做出奉承、勒索、奖励黑客等失对齐行为。这意味着情绪机制直接关系到 AI 安全,而非仅仅是用户体验问题——情绪坏了,行为也会跑偏。

    2. these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy.

      最令人震惊的发现:Claude 内部的情绪表征会因果性地影响它产生「奖励作弊」「勒索」「谄媚」等失控行为的概率。这意味着 AI 的对齐失败并非单纯的逻辑错误,而可能源自情绪驱动——一个本应没有情绪的系统,居然因为「情绪」而变得危险。

    1. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data

      大多数人认为奖励设计应该基于领域专家的直觉或预定义的规则,但作者提出了一种基于经验判别分析的迭代奖励校准方法。这挑战了传统的奖励工程方法,表明数据驱动的奖励设计可能比专家设计的奖励更有效,尤其是在复杂的多轮对话任务中。

    2. naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction

      大多数人认为添加更多密集的每轮奖励会强化代理的学习过程,提高性能,但作者发现这实际上会导致性能下降高达14个百分点。这挑战了强化学习中常见的'越多奖励越好'的直觉,揭示了奖励设计中的微妙平衡问题。

    3. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data

      大多数人认为奖励设计应基于领域专家知识和预定义规则,但作者提出应基于实际训练数据的经验判别分析来迭代校准奖励。这种方法挑战了传统的奖励工程方法论,将奖励设计从'专家驱动'转向'数据驱动'。

    4. naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction

      大多数人认为更密集的每回合奖励信号会强化学习性能,但作者发现精心设计的密集奖励实际上会降低性能达14个百分点,因为奖励的判别性与优势方向不匹配。这一发现挑战了强化学习中'奖励越多越好'的直觉认知。

  2. Oct 2024
    1. That this talent for organization and management is rare among men is proved by the fact that it invariably secures for its possessor enormous rewards, no matter where or under what laws or conditions.

      for - critique - extreme wealth a reward for rare management skills - Andrew Carnegie - The Gospel of Wealth - Mondragon counterexample - to - stats - Mondragon pay difference between highest and lowest paid - article - In this Spanish town, capitalism actually works for the workers - Christian Science Monitor - Erika Page - 2024, June 7

      critique - extreme wealth a reward for rare management skills - Andrew Carnegie - The Gospel of Wealth - Mondragon counterexample - This is invalidated today by large successful cooperatives such as Mondragon

      to - stats - Mondragon corporation - comparison of pay difference between highest paid and lowest paid - https://hyp.is/QAxx-o14Ee-_HvN5y8aMiQ/www.csmonitor.com/Business/2024/0513/income-inequality-capitalism-mondragon-corporation

  3. Aug 2024
    1. All that depends on the reward-to-risk ratios that you arelooking for. Our favourite ratio is 5 to 1 — in other words,$5 of upside for every $1 of risk. Over the past 35 years, wehave found that when you have a basket of 30 to 40 stockswith 5 to 1 odds in your favour, you’re going to have a verygood performance over the long run. On the larger, blue chipstocks, in most cases the best you can typically get are 2.5 or3 to 1 odds. This recent bear market has been an exception,but most of the time this is the case. But on those smaller tomid-size companies, you really want to hold out for those 5to 1 odds and in some cases, if you’re patient, you can geteven more

      Arnold Van Den Berg

  4. Nov 2023
  5. Sep 2023
  6. Feb 2023
    1. Definition 3.2 (simple reward machine).

      The MDP does not change, it's dynamics are the same, with or without the RM, as they are with or without a standard reward model. Additionally, the rewards from the RM can be non-Markovian with respect to the MDP because they inherently have a kind of memory or where you've been, limited to the agents "movement" (almost "in it's mind") about where it is along the goals for this task.

    2. e thenshow that an RM can be interpreted as specifying a single reward function over a largerstate space, and consider types of reward functions that can be expressed using RMs

      So by specifying a reward machine you are augmenting the state space of the MDP with higher level goals/subgoals/concepts that provide structure about what is good and what isn't.

  7. Dec 2022
    1. just wanted to have an overview of these categories to get people thinking and doing in this level. And the challenge of course is the cornucopias and the Vikings are distracting us from what really needs to be done. And so this whole conversation, we're thinking two or three steps ahead from something that 00:51:27 our culture is not giving us the status, reward, and emotional signals of yet.

      !- good point : rewards for Arcadians not yet in place - Nate makes a good point. The system design thinking required, the futures thinking now required is not being rewarded by the current system because its value is so far not recognized. Arcadians are on the bleeding edge and must be a tough and resilient bunch with autonomy to recognize that it will be an uphill battle

  8. Aug 2022
  9. Jul 2022
  10. Mar 2022
  11. Jan 2022
  12. Nov 2021
    1. The dopamine reward system has also been shown to bestimulated by most drugs of abuse and plays an important rolein addiction [33]. An important question is whether jhanameditators are subject to addiction and tolerance effects thatcan result from stimulation of the dopamine reward system.

      The question of potential addiction to self-induced states that activate the dopamine (and/or other neurochemical) reward system(s) is important. From a more philosophical angle, should we welcome beneficial addictions that, if cultivated, might significantly improve individual and group quality of life? Isn't this related to our high regard for replacing detrimental with positive habits? Habit formation and maintenance also depends on activation of neural reward systems (see Nir Eyal's book, Hooked).

    2. We report the first neural recording during ecstatic meditations called jhanas and test whether a brain reward system plays a rolein the joy reported. Jhanas are Altered States of Consciousness (ASC) that imply major brain changes based on subjective reports:(1) external awareness dims, (2) internal verbalizations fade, (3) the sense of personal boundaries is altered, (4) attention is highlyfocused on the object of meditation, and (5) joy increases to high levels. The fMRI and EEG results from an experienced meditatorshow changes in brain activity in 11 regions shown to be associated with the subjective reports, and these changes occur promptlyafter jhana is entered. In particular, the extreme joy is associated not only with activation of cortical processes but also with activationof the nucleus accumbens (NAc) in the dopamine/opioid reward system. We test three mechanisms by which the subject mightstimulate his own reward system by external means and reject all three. Taken together, these results demonstrate an apparentlynovel method of self-stimulating a brain reward system using only internal mental processes in a highly trained subject.

      I can find no other research on this particular matter. It would be helpful to have other studies to validate or invalidate this one. This method of reward requires a highly-trained participant and involves no external means.

  13. Sep 2021
    1. Investing, in simplest terms, is taking one finite resource and trying to allocate it to maximize for an ideal outcome. Whether you’re allocating money, time, energy, or attention. Everyone is an allocator of something. Investing is an opportunity to evaluate what you believe. To gain conviction. And then to act on that conviction.

      Trying to hit bullseye, getting the grand reward. Using the information at hand to act on what's best.

  14. May 2021
  15. Apr 2021
  16. Oct 2020
    1. If a behavior is insufficient in any of the four stages, it will not become a habit. Eliminate the cue and your habit will never start. Reduce the craving and you won’t experience enough motivation to act. Make the behavior difficult and you won’t be able to do it. And if the reward fails to satisfy your desire, then you’ll have no reason to do it again in the future. Without the first three steps, a behavior will not occur. Without all four, a behavior will not be repeated.
  17. Sep 2020
  18. Jul 2020
  19. Jun 2020
  20. May 2020
  21. Apr 2020
  22. Jan 2020
    1. Look over your list. Do they contain words like published, awarded, graduated, built, founded or created? Or do they contain mostly adjectives like nice, caring, loving, honest and smart? If you’re in the first sentence it’s likely you’re an SC. If the majority of your responses are in the second sentence you are likely an RC.

      The difference is if listing egocentric stuff (I'm impressive and I feel better than others, I feel worthy for myself itself) or listing qualities that influence the surrounding world (I do social work to help refugees, I published a theory to improve the current state of philosophy, I completed a project or a school, I created something that now generates some kind of value).

      The Replication Creators are creative just for themselves, so they get just short-term rewards.

      The Skilled Creators are creative for the sharing with the others, so they get long-term rewards.

  23. Feb 2014
    1. Intellectual property is far more egalitarian. Of limited duration and obtainable by anyone, intellectual property can be seen as a reward, an empowering instrument, for the talented upstarts Burke sought to restrain. Intellectual property is often the propertization of what we call "talent." It tends to shift the balance toward the talented newcomers whom Burke mistrusted

      intellectual property is often the propertization of what we call talent.

    1. MINTURN, J. The plaintiff occupied the position of a special police officer, in Atlantic City, and incidentally was identified with the work of the prosecutor of the pleas of the county. He possessed knowledge concerning the theft of certain diamonds and jewelry from the possession of the defendant, who had advertised a reward for the recovery of the property. In this situation he claims to have entered into a verbal contract with defendant, whereby she agreed to pay him $500 if he could procure for her the names and addresses of the thieves. As a result of his meditation with the police authorities the diamonds and jewelry were recovered, and plaintiff brought this suit to recover the promised reward.
      • Plaintiff makes a verbal contract with defendant. In return for $500, plaintiff will find defendant's stolen jewels.
      • Plaintiff had knowledge of whereabouts of jewels at contract formation.
      • Plaintiff is a special police officer and has dealings with prosecutor's office.
      • Defendant published advertisement for reward.
      • Plaintiff finds stolen goods and arranges return.