168 Matching Annotations
  1. Jan 2023
    1. how to encode not only individual words but words along with their context.

      i guess this is more sophisticated than the context count that we saw right at the beginning with word embeddings

    2. If we use word embeddings, the vector for cat

      embeddings are built by changing the embedding to increase the dot product between that word and its surrounding words. in this sense isn't it sort of like a special case of a language model? in that the model creates a probability distribution over sequences of words, i.e. how close two words are to each other in meaning

    3. pretrained models

      i.e. word embedding is a particular model that maps a word to a vector.

      another example is a language model that, for a sequence of words, predicts the next token

      entity extraction is another example of a pretrained model!

    4. In a model, this transfer is implemented via replacing randomly initialized embeddings with the pretrained ones (what is the same, copying weights from pretrained embeddings to your model).

      ??

    5. There are several types of transfer that are categorized nicely in Sebastian Ruder's blog post. Two large categories are transductive and inductive transfer learning: they divide all approaches into the ones where the task is the same and labels are only in the source (transductive), and where the tasks are different and labels are only in the target (inductive).

      don't get this

    1. use the classifier's accuracy as a measure of how well representations encode labels

      i.e. see whether a model can identify the meaning of these representations

    2. Well, you can't - the quality will be much lower. You need many heads in training to let them learn all these useful things.

      why is that, if they get pruned anyway?

    3. While the rare tokens head surely looks fun, don't overestimate it - most probably, this is a sign of overfitting. By looking at the least frequent tokens, a model tries to hang on to these rare "clues".

      think about this more

    4. rare tokens: the most important head on the first layer attends to the least frequent tokens in a sentence (this is true for models trained on different language pairs!).

      what does that mean exactly?

    5. Note that while BPE segmentation is deterministic, even with the same vocabulary a word can have different segmentations, e.g. un relat ed, u n relate d, un rel ated, etc.).

      if it's deterministic, how could it have different segmentations?

    6. Additionally, here LayerNorm has trainable parameters, scale and bias, which are used after normalization to rescale layer's outputs (or the next layer's inputs)

      what for?

    7. The query is used when a token looks at others - it's seeking the information to understand itself better. The key is responding to a query's request: it is used to compute attention weights. The value is used to compute attention output: it gives information to the tokens which "say" they need it (i.e. assigned large weights to this token).

      don't get this

    8. Let's look at attention weights - which source words does the decoder use?

      this is showing you what source token is heavily weighted when finding a given output

    9. what is more important, leads to worse quality.

      again, how is it possible that it decreases the quality?

      maybe because meaning decays on that length scale?

    1. forces a model to give high probability not only to the target token but also to the words close to the target in the embedding space.

      don't really get that

    2. Clearly, these samples

      maybe change temp as you go through sentence? seems like you might want to seed more variation early on, then becomes important to increase coherence

    3. Generation Strategies

      are sentences fed a prefix? I guess if you're just sampling from a distribution, you can get different sentences with no input.

    4. Residual connections are very simple: they add input of a block to its output.

      surely a dof is needed to allow the network to downweight these connections... Then again, maybe it's okay because there are gates that these combinations pass through later

    5. between the model prediction distribution p and the empirical target distribution p∗. With many training examples, this is close to minimizing the distance to the actual target distribution.

      not sure I fully get that

    6. (for the correct token yt), we will get

      that is, we're trying to maximise the probability assigned to \(y_t\) – the correct next token/class given the context

      I think the second equality below holds by definition of \(p_{y_t}\)

    7. Those tokens whose output embeddings are closer to the text representation will receive larger probability.

      this connects with the Word Embeddings lecture. we're basically finding the cosine similarity between the input and output embeddings

    8. Applying the final linear layer is equivalent to evaluating the dot product between text representation h and each of the output word embeddings.

      and note that output embeddings have same dimension as input embedding of the context... does it make sense that context (which is generally composed of many words) has the same dimensionality as tokens (typically single words)

    9. linear layer

      by training, this linear layer ensures that the V classes correspond to elements in the vocabulary. do we specify each class as being a particular element in the vocabulary?

    10. coefficients λi can be picked by cross-validation on the development set

      are these constant across the language model? or do they somehow vary based on context

    1. The only difference from SVMs in classical approaches (on top of bag-of-words and bag-of-ngrams) if the choice of a kernel: here the RBF kernel is better

      don't know what this means

    2. A good neural network will learn to represent input texts in such a way that text vectors will point in the direction of the corresponding class vectors.

      but it seems like there's a degree of freedom in this, i.e. what you need is the vector representation of the input text together with the vector representations of the classes, w, to give the right probabilities as an output

    3. Discriminative models are interested only in the conditional probability p(y|x), i.e. they learn only the border between classes.

      think about why this is true, i.e. why just learning the conditional probability is equivalent to just finding boundaries between classes

    1. Detect Words that Changed Their Usage

      interesting (related to disambiguating entities). an obvious approach is to look at the nearest neighbours of identical words within the corpora and see how similar they are – either simply count based, i.e. how many of the nearest neighbours are the same, or cosine similarity, e.g. what is the average cosine similarity of the nearest neighbours of the word

    2. Repeat the derivations (loss and the gradients) for the case with one vector for each

      at a glance, it looks like it would be the same except for a non-linearity in the sum for \(u_w\) when w is the central word. come back to this

    3. objective function

      minimising the loss function is the same as maximising the likelihood. I.e. find the parameters \(\theta\) that will maximise the likelihood of the observation

    4. The objective forces word vectors to "know" contexts a word can appear in: the vectors are trained to predict possible contexts of the corresponding words

      ?

  2. Oct 2022
    1. vector of size

      presumably, this V is very big, right? e.g. 100000s of words. how do you determine what the tokens should be? i.e. what the domain of the output is

  3. Sep 2022
    1. choose Θ̂ Θ^\hat{\Theta} to be the sample mean

      we need to choose a function on the sample that will estimate yield some desired characteristic of the population

    1. n!fX(x1)fX(x2)⋯fX(xn)

      presumably, this follows from the independence of the random variables (i.e. it's just the product of all the probabilities, with an \(n!\) to account for indistinguishability,

      i.e. we don't care whether it's the first sample that's the smallest or the 15th, we only care what the size of the 15th largest value is

    2. limn→∞P(|X⎯⎯⎯⎯−μ|≥ϵ)=0

      if you take an arbitrarily large sample, the probability of discrepancy between the sample mean and the true mean µ gets arbitrarily small

    3. FX1(x)=FX2(x)=...=FXn(x)=FX(x)FX1(x)=FX2(x)=...=FXn(x)=FX(x)F_{X_{\large{1}}}(x)=F_{X_{\large{2}}}(x)=...=F_{X_{\large{n}}}(x)=F_X(x); EXi=EX=μ<∞EXi=EX=μ<∞EX_i=EX=\mu<\infty; 0<Var(Xi)=Var(X)=σ2<∞0<Var(Xi)=Var(X)=σ2<∞0 < \mathrm{Var}(X_i)=\mathrm{Var}(X)=\sigma^2<\infty

      don't 3 and 4 follow from 2?

    4. In general, XiXiX_i is the height of the iiith person that is chosen uniformly and independently from the population.

      what does uniformly and independently really mean?

    1. you may say that frequentist (classical) inference deals with estimating non-random quantities, while Bayesian inference deals with estimating random variables.

      this seems like a bad definition. What does it mean for a quantity to be random? The bit that is sent is not truly random. They seem to be different ways of conceiving of randomness.

    2. At the receiver, XXX, which is a noisy version of ΘΘ\Theta, is received. The receiver has to recover ΘΘ\Theta from XXX

      i.e. there's noise at X in addition to the randomness of \(\Theta\)

    3. the unknown quantity ΘΘ\Theta is assumed to be a random variable, and we assume that we have some initial guess about the distribution of ΘΘ\Theta.

      i.e. the outcome of the presidential election/percentage of people that will vote for A is undetermined, i.e. it is itself a random variable with some distribution

    4. There is an unknown quantity that we would like to estimate. We get some data. From the data, we estimate the desired quantity

      for example: the outcome of the presidential election. Get some polling data. From this data, estimate the most likely outcome.

    5. drawing conclusions from data that are prone to random variation.

      i.e. what is the truth/true state, given some data about the truth that imperfectly (randomly) reflects the true state

    6. randomness comes from the noise

      seems like randomness is any source of discrepancy between the 'truth' of something and the record of 'the truth' that we receive

    1. Gamma(1,λ)=Exponential(λ)Gamma(1,λ)=Exponential(λ)Gamma(1,\lambda) = Exponential(\lambda)

      Gamma is to Exponential as Pascal is to Geometric (which makes sense, since we said that exponential is the continuous limit of Geometric)

    1. σ2Y=a2σ2X.

      note that this follows from the general equation for variance transformation

      actually, it doesn't follow that these distributions are necessarily normal

    1. where x1,x2,...,xnx1,x2,...,xnx_1, x_2,...,x_n are real solutions to g(x)=yg(x)=yg(x)=y.

      for instance, in the graph above, consider the y-intercept to be 3.

      Then there are independent solutions for g(x) = 3 in each of the 3 partitions – a, b, c from left to right – so that \(f_Y (y) = \) the sum of each of those 3 probabilities.

    2. To find the PDF of YYY, we differentiate fY(y)fY(y)f_Y(y) =ddyFX(x1)

      this seems a dodgy proof; why are we differentiating wrt to the variable that parametrises our function

    3. Note that since ggg is strictly increasing, its inverse function g−1g−1g^{-1} is well defined.

      this is not the case for x^2 for instance, that's why sqrt(x) isn't well defined everywhere

    4. g(x)g(x)g(x) is a strictly increasing function

      surely this isn't generally satisfied, e.g. for a uniform distribution over some interval, the cdf is not strictly increasing over all space

    1. the CDF is a continuous function, i.e., it does not have any jumps.

      is this still true if the PDF is only defined in certain regions of the real line?

    1. This is because of symmetry: no marble is more likely to be chosen than the iiith marble as any other marbles.

      but if the marble chosen is blue, doesn't this change the probability of picking a blue marble?

    1. PMFs of DISTRIBUTIONS

      x is the random variable in question.

      • Bernoulli(p): Indicator variable, which is 1 if the outcome occurs (e.g. heads) and 0 if not.

      • Geometric(x, p) = probability that you need to repeat the Bernoulli trial x times to get a success

      • Binomial(x, n, p) = probability that in n Bernoulli trials you get x successes.

      • Pascal(x, m, p) = probability that you need to repeat a Bernoulli trial x times to get m successes. This is a generalisation of the Geometric dist., where Pascal(x,1,p) = Geometric(x,p)

      • Hypergeometric(x,b,r) = the probability that you get x blue balls in a sample of k red and blue balls.

      • Poisson(x,L) = the probability that you get x events when you would normally get L.

    2. PY(0)+PY(1)+PY(2)+PY(3))

      ok because k=0 and k=1 and k=2 etc. are mutually exclusive, so their intersection is just the sum of the probabilities of their occurence

      why can't you renormalise the distribution by just saying it's the same as getting at least one email every 10/3 minutes?

    3. Pascal random variable with parameters mmm and p

      k is number of times you throw the coin, m is the number of times you see heads that you want to know the probability of, p is the probability of heads

    4. We usually define q=1−pq=1−pq=1-p, so we can write PX(k)=pqk−1, for k=1,2,3,...PX(k)=pqk−1, for k=1,2,3,...P_X(k)=pq^{k-1}, \textrm{ for } k=1,2,3,.... To say that a random variable has geometric distribution with parameter ppp, we write X∼Geometric(p)X∼Geometric(p)X \sim Geometric(p). More formally, we have the following definition

      Q: Is there a way to define a geometric dist if the underlying trials can have more than 2 outcomes? E.g. would you use this if throwing a die, for the number of times you had to throw until you got a 5?

    5. the indicator random variable IAIAI_A for an event A

      so each event would have to have its own indicator random variable; if S = {a,b,c,d} then X_i = {1 if i, 0 otherwise}, for i in S.

    6. A Bernoulli random variable is a random variable that can only take two possible values, usually 000 and 111.

      for instance, if your random experiment is throwing a coin once, then S is {H, T} and X:S->R is {0,1}.

      Question: does the notation f:S->R mean that each element in S maps to one element in R? I think yes: but in general, you may have multiple elements mapping to the same values in the co-domain.

    7. some specific distributions that are used over and over in practice,

      for instance, I wrote down the distribution P(y) = p(1-p)^y for the probability of throwing a coin y times before getting a head, when the probability of getting a head is p.

    1. If we already know the PMF of XXX, to find the PMF of Y=g(X)Y=g(X)Y=g(X), we can write PY(y)=P(Y=y)=P(g(X)=y)=∑x:g(x)=yPX(x)

      intuitive: to find the probability that Y = y, add up the probabilities of all the x's.

    1. For a discrete random variable XXX, we are interested in knowing the probabilities of X=xkX=xkX=x_k.

      Essentially, a RV lets us take the things in the real world (events, such as a sequence of heads or tails) and turn it into a new thing that behaves in the same way, i.e. the output of the function X:S->RealNos can be thought of kind of as a new sample space, and we can assign probabilities to it.

  4. Aug 2022
    1. not conditionally independent given CCC.

      given the additional info that C happened, we can no longer treat A and B as independent. I.e. the additional info that C leads to a dependence between A and B.

    2. One important lesson here is that, generally speaking, conditional independence neither implies (nor is it implied by) independence.

      Just because some events are conditionally independent, doesn't mean they're unconditionally independent, and v.v. This makes sense because in fact there are always conditions on events, if only implicitly, so if conditional indpendence doesn't imply unconditional independence, the converse shouldn't hold either.