Hypothesis

9 Matching Annotations

Aug 2024
arxiv.org arxiv.org

2010.11929.pdf

1
1. simmimourya 23 Aug 2024
  
  in Public
  
  Inductive bias
  
  claims they have less inductive bias than CNNs because of less locality learning. Doesn't tend to fit a constant function as opposed to very deep CNNs https://iclr-blogposts.github.io/2023/blog/2023/how-does-the-inductive-bias-influence-the-generalization-capability-of-neural-networks/
Visit annotations in context

Annotators

simmimourya

URL

arxiv.org/pdf/2010.11929
arxiv.org arxiv.org

2401.00897v2.pdf

1
1. simmimourya 16 Aug 2024
  
  in Public
  
  model a seriesof regressions one by one for one input
  
  Why one input?
Visit annotations in context

Annotators

simmimourya

URL

arxiv.org/pdf/2401.00897
sthalles.github.io sthalles.github.io

An Intuitive Introduction to the Vision Transformer - Thalles' blog

2
1. simmimourya 13 Aug 2024
  
  in Public
  
  we lose an essential piece of information – the tokens’ relative positions
  
  Hence we add positional encodings!
2. simmimourya 12 Aug 2024
  
  in Public
  
  meaning of words depends on the context they appear.
  
  This is what is different from word2vec/skip gram models vs transformers. Static vs Dynamic embedding; former one generates one embedding for a word, regardless of what context it was used in. But latter ones generate a dynamic embedding for a word, since attention is included while training the encoder for the embeddings.
  
  In more detail: There is only one vector for 'bank': a weighted average of bank the financial institution and bank the thing next to a river.
  
  Q/A The input embedding matrix of a transformer holds static embeddings, just like word2vec, How can I then get the contextual embedding of a specific word given a specific sentence?
  
  Static embeddings use context for training and a lookup table for inference. Contextual embeddings use context in training and inference.
  
  The initial layers still use static embeddings but then using self attention mechanism, a context aware embedding is generated for the word by looking at all other words in the sentence. So, when we have a different sentence using the same word, the dynamic embedding changes since the value of attention will be different for the word in different sentences.
Visit annotations in context

Annotators

simmimourya

URL

sthalles.github.io/an-intuitive-introduction-to-the-vision-transformer/
nlp.seas.harvard.edu nlp.seas.harvard.edu

The Annotated Transformer

1
1. simmimourya 12 Aug 2024
  
  in Public
  
  class Generator(nn.Module):
  
  Generates output tokens
Visit annotations in context

Annotators

simmimourya

URL

nlp.seas.harvard.edu/annotated-transformer/
jalammar.github.io jalammar.github.io

The Illustrated Transformer

4
1. simmimourya 09 Aug 2024
  
  in Public
  
  basically it would be the length of the longest sentence in our training dataset
  
  Why? because; size of the list is the Hyperparameter, not the size of embedding.
  
  Complexity of self-attention is n^2. Because we multiply n queries (at n time steps) with n keys (Ki, where i <= n) You don't multiply all the Ki with Q, just the K in the context window. Which is precisely the hyperparameter author has mentioned. size of this list: length of the longest sentence in training dataset
2. simmimourya 08 Aug 2024
  
  in Public
  
  one attention head is focusing most on "the animal", while another is focusing on "tired"
  
  8 colored boxes are 8 attention heads
3. simmimourya 08 Aug 2024
  
  in Public
  
  multiply each value vector by the softmax score (in preparation to sum them up)
  
  this makes me question, what exactly is a K vector for the word. How different is it from the V vector? Looks like K is being used to compute a strength of V. Can K and V both have same values? I am trying to relate K,V to the database analogy mentioned here: https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html
4. simmimourya 08 Aug 2024
  
  in Public
  
  with the key vector of the respective word we’re scoring.
  
  Read until equation 11.1.3 https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html
  
  Now that we have a database consisting of (k,v) pairs, one way of calculating this score is by calculating similarity between a given query and all the keys. Dot product simulates that.
Visit annotations in context

Annotators

simmimourya

URL

jalammar.github.io/illustrated-transformer/

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL