Hypothesis

7 Matching Annotations

Apr 2023
clementneo.com clementneo.com

We Found An Neuron in GPT-2

1
1. mshook 12 Apr 2023
  
  in Public
  
  It seems like the neuron basically adds the embedding of “ an” to the residual stream, which increases the output probability for “ an” since the unembedding step consists of taking the dot product of the final residual with each token2.
  
  This cleared the dust from my eyes in understanding what the MLP layer does
  
  transformer embedding how residual explanation mlp
Visit annotations in context

Tags

transformer

explanation

how

mlp

residual

embedding

Annotators

mshook

URL

clementneo.com/posts/2023/02/11/we-found-an-neuron
Feb 2023
e2eml.school e2eml.school

Transformers from Scratch

2
1. mshook 12 Feb 2023
  
  in Public
  
  The second purpose of skip connections is specific to transformers — preserving the original input sequence.
  
  transformer why residual architecture skip
2. mshook 12 Feb 2023
  
  in Public
  
  Skip connections serve two purposes. The first is that they help keep the gradient smooth, which is a big help for backpropagation. Attention is a filter, which means that when it’s working correctly it will block most of what tries to pass through it.
  
  transformer architecture residual attention backpropagation
Visit annotations in context

Tags

backpropagation

why

architecture

skip

transformer

attention

residual

Annotators

mshook

URL

e2eml.school/transformers.html
www.lesswrong.com www.lesswrong.com

Induction heads - illustrated

1
1. mshook 11 Feb 2023
  
  in Public
  
  The central object in the transformer is the residual stream.
  
  transformer architecture nlp residual icl induction
Visit annotations in context

Tags

transformer

icl

induction

residual

nlp

architecture

Annotators

mshook

URL

lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-heads-illustrated
Jan 2023
transformer-circuits.pub transformer-circuits.pub

A Mathematical Framework for Transformer Circuits

2
1. mshook 26 Jan 2023
  
  in Public
  
  One of the main features of the high level architecture of a transformer is that each layer adds its results into what we call the “residual stream.”Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks and LSTMs, which have found significant modern success in the more recent residual network architecture . In transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel, since it doesn't do any processing itself and all layers communicate through it.
  
  transformer residual architecture explanation colah attention image
2. mshook 26 Jan 2023
  
  in Public
  
  A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
  
  transformer residual architecture alternative colah explanation
Visit annotations in context

Tags

alternative

explanation

colah

architecture

transformer

attention

image

residual

Annotators

mshook

URL

transformer-circuits.pub/2021/framework/index.html
Feb 2019
iphysresearch.github.io iphysresearch.github.io

A Paper A Day

1
1. Herb 01 Feb 2019
  
  in Public
  
  Fixup Initialization: Residual Learning Without Normalization
  
  关于拟合的表现，Regularization 和 BN 的设计总是很微妙，尤其是 learning rate 再掺和进来以后。此 paper 的作者也就相关问题结合自己的文章在 Reddit 上有所讨论。
  
  initialization Batch Normalization Residual
Visit annotations in context

Tags

Batch Normalization

Residual

initialization

Annotators

Herb

URL

iphysresearch.github.io/paper_summary/APaperADay.html

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL