Hypothesis

6 Matching Annotations

Dec 2022
scottaaronson.blog scottaaronson.blog

My AI Safety Lecture for UT Effective Altruism

5
1. ravenscroftj 19 Dec 2022
  
  in Public
  
  Now, this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPT’s output—well okay, we’re not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, it’s robust against those sorts of interventions.
  
  this mechanism can be defeated by paraphrasing the output with another model
  
  explainability nlproc
2. ravenscroftj 19 Dec 2022
  
  in Public
  
  Anyway, we actually have a working prototype of the watermarking scheme, built by OpenAI engineer Hendrik Kirchner. It seems to work pretty well—empirically, a few hundred tokens seem to be enough to get a reasonable signal that yes, this text came from GPT. In principle, you could even take a long text and isolate which parts probably came from GPT and which parts probably didn’t.
  
  Scott's team hsas already developed a prototype watermarking scheme at OpenAI and it works pretty well
  
  explainability nlproc
3. ravenscroftj 19 Dec 2022
  
  in Public
  
  So then to watermark, instead of selecting the next token randomly, the idea will be to select it pseudorandomly, using a cryptographic pseudorandom function, whose key is known only to OpenAI.
  
  Watermarking by applying cryptographic pseudorandom functions to the model output instead of true random (true pseudo-random)
  
  explainability nlproc
4. ravenscroftj 19 Dec 2022
  
  in Public
  
  Eventually GPT will say, “oh, I know what game we’re playing! it’s the ‘give false answers’ game!” And it will then continue playing that game and give you more false answers. What the new paper shows is that, in such cases, one can actually look at the inner layers of the neural net and find where it has an internal representation of what was the true answer, which then gets overridden once you get to the output layer.
  
  this is fascinating - GPT learns the true answer to a question but will ignore it and let the user override this in later layers of the model
  
  explainability nlproc
5. ravenscroftj 19 Dec 2022
  
  in Public
  
  (3) A third direction, and I would say maybe the most popular one in AI alignment research right now, is called interpretability. This is also a major direction in mainstream machine learning research, so there’s a big point of intersection there. The idea of interpretability is, why don’t we exploit the fact that we actually have complete access to the code of the AI—or if it’s a neural net, complete access to its parameters? So we can look inside of it. We can do the AI analogue of neuroscience. Except, unlike an fMRI machine, which gives you only an extremely crude snapshot of what a brain is doing, we can see exactly what every neuron in a neural net is doing at every point in time. If we don’t exploit that, then aren’t we trying to make AI safe with our hands tied behind our backs?
  
  Interesting metaphor - it is a bit like MRI for neural networks but actually more accurate/powerful
  
  nlproc explainability
Visit annotations in context

Tags

explainability

nlproc

Annotators

ravenscroftj

URL

scottaaronson.blog/
Nov 2022
www.exponentialview.co www.exponentialview.co

🔮 Azeem's commentary: On the generative wave (Part 1)

1
1. ravenscroftj 21 Nov 2022
  
  in Public
  
  “The metaphor is that the machine understands what I’m saying and so I’m going to interpret the machine’s responses in that context.”
  
  Interesting metaphor for why humans are happy to trust outputs from generative models
  
  generative models machine learning ml explainability
Visit annotations in context

Tags

ml explainability

machine learning

generative models

Annotators

ravenscroftj

URL

exponentialview.co/p/azeems-commentary-on-the-generative

Tags

Annotators

URL

Tags

Annotators

URL