3 Matching Annotations
  1. Apr 2023
    1. It seems like the neuron basically adds the embedding of “ an” to the residual stream, which increases the output probability for “ an” since the unembedding step consists of taking the dot product of the final residual with each token2.

      This cleared the dust from my eyes in understanding what the MLP layer does

  2. Feb 2023
    1. Once we have the result of our attention step, a vector that includes the most recent word and a small collection of the words that have preceded it, we need to translate that into features, each of which is a word pair. Attention masking gets us the raw material that we need, but it doesn’t build those word pair features. To do that, we can use a single layer fully connected neural network.

      Early transformer exploration focused on the attention layer/mechanism.The MLP that follows the attention layer is now being explored. ROME for example.

  3. Feb 2022
    1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.