Hypothesis

33 Matching Annotations

Dec 2023
www.semanticscholar.org www.semanticscholar.org

[PDF] Wayformer: Motion Forecasting via Simple & Efficient Attention Networks | Semantic Scholar

1
1. hxdflying 21 Dec 2023
  
  in Public
  
  Self-attention is naturally permutation equivariant, therefore, we maythink of them as set-encoders rather than sequence encoders. However, for modalities where thedata does follow a specific ordering, for example agent state across different time steps, it is ben-eficial to break permutation equivariance and utilize the sequence information. This is commonlydone through positional embeddings. For simplicity, we add learned positional embeddings for allmodalities. As not all modalities are ordered, the learned positional embeddings are initially set tozero, letting the model learn if it is necessary to utilize the ordering within a modality.
  
  在轨迹预测中，对于我们是否使用transformer中的positional Embeddings我们需要多方面考虑
  
  transformer
Visit annotations in context

Tags

transformer

Annotators

hxdflying

URL

semanticscholar.org/reader/44f6612c238297304331d6fe6aa4b4f909f1c6f0
arxiv.org arxiv.org

2210.09461.pdf

1
1. devrimcavusoglu 14 Dec 2023
  
  in Public
  
  TOKEN MERGING: YOUR VIT BUT FASTER
  
  token-merging tome iclr-2023 iclr facebook vit transformer computer-vision
Visit annotations in context

Tags

token-merging

transformer

computer-vision

vit

tome

iclr-2023

facebook

iclr

Annotators

devrimcavusoglu

URL

arxiv.org/pdf/2210.09461.pdf
Apr 2023
ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org

What learning algorithm is in-context learning? Investigations with linear models

1
1. mshook 22 Apr 2023
  
  in Public
  
  While past work has characterized what kinds of functions ICL can learn (Garg et al., 2022; Laskin et al., 2022) and the distributional properties of pretraining that can elicit in-context learning (Xie et al., 2021; Chan et al., 2022), but how ICL learns these functions has remained unclear. What learning algorithms (if any) are implementable by deep network models? Which algorithms are actually discovered in the course of training? This paper takes first steps toward answering these questions, focusing on a widely used model architecture (the transformer) and an extremely well-understood class of learning problems (linear regression).
  
  icl how algorithm transformer stanford mit linerar regression
Visit annotations in context

Tags

algorithm

stanford

icl

mit

how

linerar

regression

transformer

Annotators

mshook

URL

ar5iv.labs.arxiv.org/html/2211.15661
clementneo.com clementneo.com

We Found An Neuron in GPT-2

1
1. mshook 12 Apr 2023
  
  in Public
  
  It seems like the neuron basically adds the embedding of “ an” to the residual stream, which increases the output probability for “ an” since the unembedding step consists of taking the dot product of the final residual with each token2.
  
  This cleared the dust from my eyes in understanding what the MLP layer does
  
  transformer embedding how residual explanation mlp
Visit annotations in context

Tags

embedding

explanation

how

residual

mlp

transformer

Annotators

mshook

URL

clementneo.com/posts/2023/02/11/we-found-an-neuron
www.semanticscholar.org www.semanticscholar.org

[PDF] What Can Transformers Learn In-Context? A Case Study of Simple Function Classes | Semantic Scholar

2
1. mshook 01 Apr 2023
  
  in Public
  
  We use a decoder-only Transformer architecture [Vaswani et al., 2017] from the GPT-2family
  
  gpt2 gpt transformer
2. mshook 01 Apr 2023
  
  in Public
  
  a random function f
  
  a random function not many or several
  
  icl transformer gpt2 gpt function
Visit annotations in context

Tags

icl

function

gpt

gpt2

transformer

Annotators

mshook

URL

semanticscholar.org/reader/de32da8f5c6a50a6c311e9357ba16aa7d05a1bc9
Mar 2023
www.lesswrong.com www.lesswrong.com

interpreting GPT: the logit lens - LessWrong

1
1. mshook 14 Mar 2023
  
  in Public
  
  copying a rare token
  
  This is a Induction Head at work, yes?
  
  induction head transformer icl
Visit annotations in context

Tags

induction

head

icl

transformer

Annotators

mshook

URL

lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
Feb 2023
clementneo.com clementneo.com

We Found An Neuron in GPT-2

2
1. mshook 15 Feb 2023
  
  in Public
  
  Logit Lens
  
  https://colab.research.google.com/github/UFO-101/an-neuron/blob/main/an_neuron_investigation.ipynb#scrollTo=uC3apWHK5KMa
  
  logit code colab transformer
2. mshook 15 Feb 2023
  
  in Public
  
  The code to reproduce our results can be found here.
  
  https://github.com/UFO-101/an-neuron
  
  https://colab.research.google.com/github/UFO-101/an-neuron/blob/main/an_neuron_investigation.ipynb
  
  code transformer gpt interpretability colab ipynb cool
Visit annotations in context

Tags

colab

code

gpt

cool

logit

ipynb

interpretability

transformer

Annotators

mshook

URL

clementneo.com/posts/2023/02/11/we-found-an-neuron
e2eml.school e2eml.school

Transformers from Scratch

3
1. mshook 12 Feb 2023
  
  in Public
  
  The second purpose of skip connections is specific to transformers — preserving the original input sequence.
  
  transformer why residual architecture skip
2. mshook 12 Feb 2023
  
  in Public
  
  Skip connections serve two purposes. The first is that they help keep the gradient smooth, which is a big help for backpropagation. Attention is a filter, which means that when it’s working correctly it will block most of what tries to pass through it.
  
  transformer architecture residual attention backpropagation
3. mshook 12 Feb 2023
  
  in Public
  
  Once we have the result of our attention step, a vector that includes the most recent word and a small collection of the words that have preceded it, we need to translate that into features, each of which is a word pair. Attention masking gets us the raw material that we need, but it doesn’t build those word pair features. To do that, we can use a single layer fully connected neural network.
  
  Early transformer exploration focused on the attention layer/mechanism.The MLP that follows the attention layer is now being explored. ROME for example.
  
  fact mlp transformer nlp layer
Visit annotations in context

Tags

layer

why

backpropagation

nlp

mlp

skip

fact

residual

architecture

attention

transformer

Annotators

mshook

URL

e2eml.school/transformers.html
www.lesswrong.com www.lesswrong.com

Induction heads - illustrated

1
1. mshook 11 Feb 2023
  
  in Public
  
  The central object in the transformer is the residual stream.
  
  transformer architecture nlp residual icl induction
Visit annotations in context

Tags

icl

induction

residual

architecture

nlp

transformer

Annotators

mshook

URL

lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-heads-illustrated
arxiv.org arxiv.org

2202.05262.pdf

1
1. mshook 01 Feb 2023
  
  in Public
  
  the Elhage et al.(2021) study showing an information-copying role for self-attention.
  
  It turns out Meng does refer to induction heads, just not by name.
  
  induction attention ml nn transformer
Visit annotations in context

Tags

nn

induction

ml

attention

transformer

Annotators

mshook

URL

arxiv.org/pdf/2202.05262
Jan 2023
transformer-circuits.pub transformer-circuits.pub

A Mathematical Framework for Transformer Circuits

2
1. mshook 26 Jan 2023
  
  in Public
  
  One of the main features of the high level architecture of a transformer is that each layer adds its results into what we call the “residual stream.”Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks and LSTMs, which have found significant modern success in the more recent residual network architecture . In transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel, since it doesn't do any processing itself and all layers communicate through it.
  
  transformer residual architecture explanation colah attention image
2. mshook 26 Jan 2023
  
  in Public
  
  A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
  
  transformer residual architecture alternative colah explanation
Visit annotations in context

Tags

colah

alternative

image

explanation

residual

architecture

attention

transformer

Annotators

mshook

URL

transformer-circuits.pub/2021/framework/index.html
theaisummer.com theaisummer.com

How Transformers work in deep learning and NLP: an intuitive introduction

1
1. mshook 20 Jan 2023
  
  in Public
  
  You see the values of the self-attention weights are computed on the fly. They are data-dependent dynamic weights because they change dynamically in response to the data (fast weights).
  
  attention transformer why weight layer nlp
Visit annotations in context

Tags

layer

why

weight

nlp

attention

transformer

Annotators

mshook

URL

theaisummer.com/transformer/
ar5iv.labs.arxiv.org ar5iv.labs.arxiv.org

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

1
1. mshook 17 Jan 2023
  
  in Public
  
  This input embedding is the initial value of the residual stream, which all attention layers and MLPs read from and write to.
  
  transformer interpretability ml nlp attention circuit
Visit annotations in context

Tags

circuit

nlp

ml

attention

interpretability

transformer

Annotators

mshook

URL

ar5iv.labs.arxiv.org/html/2211.00593
Sep 2022
e2eml.school e2eml.school

Transformers from Scratch

1
1. mshook 06 Sep 2022
  
  in Public
  
  To see how this plays out, we can continue looking at matrix shapes. Tracing the matrix shape through the branches and weaves of the multihead attention blocks requires three more numbers. d_k: dimensions in the embedding space used for keys and queries. 64 in the paper. d_v: dimensions in the embedding space used for values. 64 in the paper. h: the number of heads. 8 in the paper.
  
  transformer image dimension array attention scratch
Visit annotations in context

Tags

image

scratch

array

dimension

attention

transformer

Annotators

mshook

URL

e2eml.school/transformers.html
pyimagesearch.com pyimagesearch.com

A Deep Dive into Transformers with TensorFlow and Keras: Part 1 - PyImageSearch

1
1. mshook 05 Sep 2022
  
  in Public
  
  Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
  
  transformer concise explanation paragraph ml nlp nn dimension
Visit annotations in context

Tags

dimension

nlp

ml

nn

concise

paragraph

explanation

transformer

Annotators

mshook

URL

pyimagesearch.com/2022/09/05/a-deep-dive-into-transformers-with-tensorflow-and-keras-part-1/
Jun 2022
direct.mit.edu direct.mit.edu

Human Language Understanding & Reasoning

1
1. mshook 14 Jun 2022
  
  in Public
  
  The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
  
  transformer explanation attention qkv ml nn nlp language gpt good
Visit annotations in context

Tags

language

nlp

ml

gpt

nn

qkv

explanation

good

attention

transformer

Annotators

mshook

URL

direct.mit.edu/daed/article/151/2/127/110621/Human-Language-Understanding-amp-Reasoning
e2eml.school e2eml.school

Transformers from Scratch

1
1. mshook 09 Jun 2022
  
  in Public
  
  This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.
  
  Matrix multiplication as table lookup
  
  transformer language nlp ml nn explanation matrix
Visit annotations in context

Tags

nn

matrix

explanation

language

nlp

ml

transformer

Annotators

mshook

URL

e2eml.school/transformers.html
May 2022
colab.research.google.com colab.research.google.com

Google Colaboratory

1
1. mshook 13 May 2022
  
  in Public
  
  The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).
  
  translation spanish english ml transformer colab how meaning
Visit annotations in context

Tags

colab

meaning

ml

translation

english

spanish

how

transformer

Annotators

mshook

URL

colab.research.google.com/drive/1lAkTmwBpyVXyZEPFpp_DVUHnfkHoX_bC
Dec 2021
towardsdatascience.com towardsdatascience.com

Transformer models…How did it all start?

1
1. mshook 08 Dec 2021
  
  in Public
  
  The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
  
  ml attention history transformer why good
Visit annotations in context

Tags

history

why

good

ml

attention

transformer

Annotators

mshook

URL

towardsdatascience.com/transformer-models-how-did-it-all-start-2e5b385ddd93
Nov 2021
e2eml.school e2eml.school

Transformers from Scratch

1
1. mshook 26 Nov 2021
  
  in Public
  
  The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
  
  transformer attention ml good explanation nn qkv
Visit annotations in context

Tags

nn

qkv

explanation

good

ml

attention

transformer

Annotators

mshook

URL

e2eml.school/transformers.html
towardsdatascience.com towardsdatascience.com

Transformers Explained Visually — Not just how, but Why they work so well

1
1. mshook 20 Nov 2021
  
  in Public
  
  The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.
  
  Finally
  
  transformer query key value qkv attention ml nn good
Visit annotations in context

Tags

transformer

value

ml

nn

qkv

good

query

attention

key

Annotators

mshook

URL

towardsdatascience.com/transformers-explained-visually-not-just-how-but-why-they-work-so-well-d840bd61a9d3
www.lesswrong.com www.lesswrong.com

interpreting GPT: the logit lens - LessWrong

1
1. mshook 20 Nov 2021
  
  in Public
  
  Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
  
  gpt how ml nn transformer belief attention
Visit annotations in context

Tags

nn

belief

how

gpt

ml

attention

transformer

Annotators

mshook

URL

lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
towardsdatascience.com towardsdatascience.com

A Deep Dive Into the Transformer Architecture — The Development of Transformer Models

2
1. mshook 17 Nov 2021
  
  in Public
  
  The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.
  
  Could you be more specific?
  
  attention how ml nn transformer
2. mshook 17 Nov 2021
  
  in Public
  
  Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
  
  ml nn transformer attention good explanation
Visit annotations in context

Tags

ml

nn

explanation

good

how

attention

transformer

Annotators

mshook

URL

towardsdatascience.com/a-deep-dive-into-the-transformer-architecture-the-development-of-transformer-models-acbdf7ca34e0
Aug 2021
towardsdatascience.com towardsdatascience.com

Transformers – Towards Data Science

1
1. mshook 28 Aug 2021
  
  in Public
  
  So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
  
  attention query key value good transformer
Visit annotations in context

Tags

key

value

query

good

attention

transformer

Annotators

mshook

URL

towardsdatascience.com/transformers-141e32e69591
arxiv.org arxiv.org

Big Bird: Transformers for Longer Sequences

1
1. mshook 06 Aug 2021
  
  in Public
  
  We show that BigBird is a universal approximator of sequence functions and is Turing complete,
  
  turing complete machine nlp transformer ml nn attention august 2021
Visit annotations in context

Tags

machine

august

complete

nlp

ml

nn

2021

attention

turing

transformer

Annotators

mshook

URL

arxiv.org/abs/2007.14062
Jan 2021
psyarxiv.com psyarxiv.com

Representing and Predicting Everyday Behavior

1
1. Grace1999 29 Jan 2021
  
  in BehSci
  
  Singh, M., Richie, R., & Bhatia, S. (2020, October 7). Representing and Predicting Everyday Behavior. https://doi.org/10.31234/osf.io/kb53h
  
  is:preprint lang:en COVID-19 Decision making Distributed semantics Machine learning Transformer models Behavior Science Modeling Prediction
Visit annotations in context

Tags

is:preprint

Modeling

COVID-19

Machine learning

Decision making

Behavior Science

Prediction

Transformer models

Distributed semantics

lang:en

Annotators

Grace1999

URL

psyarxiv.com/kb53h/
May 2020
github.com github.com

deepset-ai/haystack

1
1. edampf 29 May 2020
  
  in BehSci
  
  Deepset-ai/haystack. (2020). [Python]. deepset. https://github.com/deepset-ai/haystack (Original work published 2019)
  
  is:other Github lang:en information access modeling Haystack question answering developer transformer based model code database document prototyping algorithm neural model tutorial how-to
Visit annotations in context

Tags

information access

database

neural model

document

modeling

lang:en

question answering

developer

transformer based model

code

Haystack

algorithm

how-to

Github

prototyping

tutorial

is:other

Annotators

edampf

URL

github.com/deepset-ai/haystack

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators