Hypothesis

11 Matching Annotations

May 2024
ai-plans.com ai-plans.com

Enhancing Corrigibility in AI Systems through Robust Feedback Loops

2
1. aadua09duaa 23 May 2024
  
  in Public
  
  IMPLEMENTING A ROBUST FEEDBACK LOOP The idea behind implementing a robust feedback loop is to create a dynamic and interactive environment where users and the AI system can learn from each other. This enables the system to align more closely with the user’s values, expectations, and instructions. The proposed feedback loop consists of five steps: 1. User Feedback Allowing users to provide feedback on the AI’s outputs is the foundation of this model. The feedback could be regarding factual inaccuracies, misconceptions of context, or any actions that violate the user’s values or expectations. The system should be equipped with a simple, intuitive interface to facilitate this. 2. Feedback Interpretation The AI system should interpret the feedback considering the appropriate context. It should not limit its understanding to immediate corrections but also infer the larger implications for similar future situations. Advanced natural language processing techniques and contextual understanding algorithms can be used to achieve this. 3. Action and Learning After interpreting the feedback, the AI should take immediate corrective actions. Additionally, it should learn from this feedback to adjust its future responses. This learning can be facilitated by reinforcement learning techniques, where the AI adjusts its actions based on the positive or negative feedback received. 4. Confirmation Feedback The system should then confirm with the user whether the correction has been implemented appropriately. This ensures that the system has correctly understood and applied the user’s feedback. This can be done through a simple confirmation message or by demonstrating the corrected behavior. 5. Iterative Improvement Finally, this process should be iterative, allowing the system to continuously learn and improve from ongoing user feedback. Each cycle of the feedback loop should refine the system’s responses and behaviors.
  
  What part of this is critically better than what's already done???
2. aadua09duaa 23 May 2024
  
  in Public
  
  Here is an expanded and more detailed version of the Implementing a Robust Feedback Loop:
  
  This doesn't seem to actually provide the steps to implement anything? Just broad ideas
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/2c03e59557b5
ai-plans.com ai-plans.com

A General Theoretical Paradigm to Understand Learning from Human Preferences

1
1. aadua09duaa 09 May 2024
  
  in Public
  
  orcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an approach that bypasses the second approximation and learn directly a policy from collected data without the reward modelling stage. However, this method still heavily relies on the first approx
  
  testunfg
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/dac45bd373a6
Jan 2024
ai-plans.com ai-plans.com

Using Consensus Mechanisms as an Approach to Alignment

1
1. aadua09duaa 05 Jan 2024
  
  in Public
  
  Using Mechanism Design and forms of Technical Governance to approach alignment from a different angle, trying to create a stable equilibria that can scale as AI intelligence and proliferation escalates, with safety mechanisms and aligned objectives built-into the greater network. “Both the cryptoeconomics research community and the AI safety/new cyber-governance/existential risk community are trying to tackle what is fundamentally the same problem: how can we regulate a very complex and very smart system with unpredictable emergent properties using a very simple and dumb system whose properties once created are inflexible?” -Vitalik Buterin, founder of Ethereum I think this was as true in 2016 as it still is today. And I think one approach to attacking the problem of alignment is not just by combining these two communities, but combining elements of each technology and understanding. There are two different elements to the problem of Alignment. Getting an AI to do the things we want, and being able to come to terms on what we actually want. We’ve gotta align the AI to the humans, and we also gotta align the humans to the other humans (both present and future). My idea takes from my experience in how DAOs and other mechanisms try to solve large-scale coordination failures and a different kind of reward function. Another element where combination could work is the ideas of Futarchy, as first imagined by Robin Hanson (vote on values, bet on beliefs), and applying it to both consensus making and AI. Policy/metric network Humans all over the world set goals or metrics that they want to achieve. This will be in the form of something like a global DAO, with verification using something like OpenAI’s WorldCoin. These are not infinite. They are not maximum utility forever goals. They have end dates. They have set definitions by humans. Example: reduce malaria by x%. Prediction Network We have humans make predictions about which implementations will result in the policy/metric succeeding. These predictions include predicting that humans in the future, after the policy was implemented, will approve of its implementation. These approvals will be set by the policy network after the implementation in various sequences (right after implementation, a year after, 10 years, 100 years, etc.) There is no end date for the approvals continuing. There is no point where it will be totally safe for deception, in other words. An AI will be trained on the data from this prediction network. The AI on this prediction network never stops training. It is always continuing its training run. The network generalizes to assume approvals in the future, and can measure the gaps between each process. The approvals start at very fast intervals, perhaps minutes, before getting further and further apart. The process never ends. There will always be approvals for the same policies in the future. Perhaps being trained on the network data from the past of the human prediction network could help with this generalization. This does run the risk of it just trying to imitate what a human prediction network would do, however. Why major AI Labs and the public might change as a result of this I think many major AI Labs (to a degree) are actually thinking about the longterm future, and the concerns that come with it, and want good outcomes. My approach keeps all humans in-the-loop on this consensus-building process, so that they are not also left out. I think starting work on this early is better than waiting for the problem to arise later. I do not expect a world where humans regret *not* working on this problem sooner. This is a work-in-progress I don’t see many trying to hit alignment from this angle, and I imagine a lot of this will be changed, or added to. But I think it could be a foundation to build a system that can handle an increasing amount of chaos from increases in intelligence. One stable equilibrium is all humans dying, and it seems the least complex return to stasis. But this implementation could be the groundwork for building another equilibrium. Why I chose this This is an extremely neglected problem. Part of my concern is aligning humans with AI, but I am also concerned with aligning humans so that humans do not double-cross or resort to violence against each other for power-seeking. Another concern I have, with the first two concerns are solved, is locking us into a future we'll actually end up regretting. My endeavor with this is to make progress on aligning AIs with longterm human interests, reduce the threat of violence between humans, and give humans more freedom post-ASI to have control over their own future. Potential Short-Term Testing Starting out would probably involve first figuring out the game theory, architecture, and design of the process better. Then it might involve creating a test network, with various people participating in the Policy/Metric Network, and others in the Prediction Network, and training an AI on this data. The prediction network would use fake money, without the tokens being tradable, for legal reasons. The AI would obviously not be a superintelligence, or anything close, but it might give us some insights of how this might obviously fail. The initial architecture would be using some form of a DAO structure for the Policy/Metric network, with a prediction market for the other network. The AI would probably be built using Pytorch. It would be optimized to reduce inaccuracy of how human's will rate policies in the future. Limitations to current testing We don't have an AI with longterm planning skills. Most AIs currently seem very myopics, without much foresight. The AI would also not be "grounded" with a real-world model, so it's modeling of future events would not be very good. The main goal of this is to start to build on how an architecture for this might look in the future, not a solution that can be implemented now. Next steps I will start off by developing my own insights and design better, getting feedback from those who have a good knowledge base for this sort of approach. After that, I might bring someone on part-time to work with me on this. Would this address RSI? I’m not sure. I think this sort of system building would favor slower takeoffs. It’s about creating a new system that can handle the continued escalation of option space (power) and maintain some stability. A lot of this isn’t worked out yet. It could be all hold a ‘piece’ of the large system, but the piece is useless on its own. Or if agents do get out into the wild, it could be some form of aggregating agents, so that the accumulation of the agents is always stronger than any smaller group of them. It’s also possible a major policy from the network could be to detect or prevent RSIs from emerging. Wouldn’t this lead to wireheading? I don’t really think wireheading is likely in most scenarios. I might give this a 5% chance of wireheading or some form of reward hacking. I’d probably place a higher chance that there could be a gradual decay of our own ability to assess approval. What about proxy goals? Proxy goals are easily the biggest concern here. But it’s being optimized to reduce inaccuracy, and all proxy goals would still need to fulfill that. Things don’t really ever move out of distribution. I think, if takeoffs are faster, the proxy goals become a much greater threat. A slower increase in intelligence I think has a better chance of aligning the proxy goals to our interests. And it is continuously being updated on its weights, based on input from approval policies, which could allow for a sort of ‘correcting’ mechanism if the proxies start to stray too far. Think of the image above of the sun veering through the galaxy, with the planets orbiting around it. The sun is the optimization process, and the planets are the proxies. The planets sometimes veer away from the sun, but gravity keeps them coming back, so that they never veer too far away from it. Closer orbits are obviously safer than more distant orbits (it’d be better if the proxies were like Earth/Mars/Mercury/Venus distant instead of Neptune/Uranus distant). Since approvals will be between short timeframes at the beginning, and there will always be new policies to approve, this might keep the proxies in close-enough orbit not to do anything that would cause significant harm. And overtime, the proxies should change and become more and more closely tied to the loss function. Would this be agentic? That depends on the execution phase. That’s the critical part, and not sure what exactly that would look like without involving high risk. I’m not sure if the execution phase actually has to be AI, or just humans executing on a plan. But it needs to be strong-enough to outcompete whatever current other intelligent systems are out there. And it would continuously have to outcompete them, meaning its power or speed might have to increase overtime, which might make using solely humans difficult. Maybe it’ll be executed by many agents, run everywhere, with consensus mechanisms in place to safeguard against rogues. A rogue agent could be identified to not be following the plan of the policy, and all other agents could then collectively act against it. Work to be done There are probably many ways this could fail. But I think this is attacking the problem from a completely different angle than most are currently doing. I think a lot of progress can be made on this with more work. It also helps solve the human-alignment problem, where trying to seize the AI for your own control would be more difficult with this kind of network, and it allows humans to continue to have their own agency into the future (removing the threat of value lockin). What is great for humans now might not be great for us a thousand years from now. This gives us the chance to be wrong, and still succeed. My current analysis is that this approach is kind of awful. But most approaches are kind of awful right now. In a few months or years, this approach might change to being just ‘slightly awful’, and then get upgraded to ‘actually okay’. ‘Actually okay’ is far better than anything we currently have, and it’s a moonshot. I’m not harboring any delusions that this is the ‘one true approach’. But, if it actually worked, I think this sort of superintelligence is the sort of future I’d be much more happy with. We don’t lose complete control. We don’t have to figure out what fundamental values we want to instill on the Universe right away. And it’s something we can build on overtime. Post: https://www.lesswrong.com/posts/2SCSpN7BRoGhhwsjg/using-consensus-mechanisms-as-an-approach-to-alignment Follow up post: https://www.lesswrong.com/posts/9xaW2yQRpyjp23ikg/slaying-the-hydra-toward-a-new-game-board-for-ai Critique: https://www.lesswrong.com/posts/9xaW2yQRpyjp23ikg/slaying-the-hydra-toward-a-new-game-board-for-ai Response: https://www.lesswrong.com/posts/9xaW2yQRpyjp23ikg/slaying-the-hydra-toward-a-new-game-board-for-ai
  
  Seems a bit crackpotty
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/83ceb10d83c0
ai-plans.com ai-plans.com

Legible Normativity for AI Alignment: The Value of Silly Rules

2
1. aadua09duaa 05 Jan 2024
  
  in Public
  
  What's meant by normativity here?
2. aadua09duaa 05 Jan 2024
  
  in Public
  
  Breaks down under singleton scenariosThe framework is only suitable for multi-agent situations, as rule-based systems might break down under scenarios with superintelligent AI. In a situation where a single system has a decisive strategic advantage, rule adherence becomes very different. What does it mean for an agent to follow rules if it has the power to rewrite them and influence them in a large capacity? If an AI can rewrite silly and important rules, a rule-based system becomes inadequate for controlling its behavior.
  
  Is this true?
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/0c595992e3dd
Dec 2023
ai-plans.com ai-plans.com

The ISITometer: A Solution for Intra-Human and AI/Human Alignment (and UBI in the process)

1
1. aadua09duaa 17 Dec 2023
  
  in Public
  
  I was hopeful that this feedback might yield some insights on how I could improve the plan.
  
  could be improved by learning more about alignment and AI
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/2859ee0481aa
ai-plans.com ai-plans.com

AI alignment metric - LIFE (extended definition)

1
1. aadua09duaa 17 Dec 2023
  
  in Public
  
  For the purpose of AI alignment it seems redundant and self-referential. On the other hand it might be good to explicitly state to the AI - "hey we are aware of your superpowers, be kind, when in doubt ask"
  
  so silly
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/bd5ec53719e6
ai-plans.com ai-plans.com

Scalable agent alignment via reward modeling: a research direction

1
1. aadua09duaa 17 Dec 2023
  
  in Public
  
  One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions.
  
  Example
  
  Example
Visit annotations in context

Tags

Example

Annotators

aadua09duaa

URL

ai-plans.com/post/684b71b5c33a
Oct 2023
ai-plans.com ai-plans.com

Open Agency Architecture

1
1. aadua09duaa 06 Oct 2023
  
  in Public
  
  s sciences, it will more closely resemble a collection of world models rather than a single, unified one. Quantifying the uncertainty in the world is challenging (as in the case of Knightian uncertainty), making it difficult to ensure that the correct theory has been considered and incorporated. Thus, infra-Bayesianism should be employed for th
  
  wehjhg
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/b22966c3506b
ai-plans.com ai-plans.com

Adversarial Robustness as a Prior for Learned Representations

1
1. aadua09duaa 06 Oct 2023
  
  in Public
  
  hortcomings that, as we illustrate, prevent them from fully realizing this goal. In this work, we show that robust optimization can be re-cast as a tool for enforcing priors on the features learned by deep neural networks. It turns out that representations learned by robust models address the aforementioned shortcomings and make significant progress towards learning a high-level encoding of inputs. In particular, these representations are approximately invertible, while allowing for direct
  
  zxczxcz
Visit annotations in context

Annotators

aadua09duaa

URL

ai-plans.com/post/e293e3bbe3e5

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Tags

Annotators

URL

Annotators

URL

Annotators

URL