two-layer LSTM language models: forward and backward
how is this different to bi-directional RNNs?
two-layer LSTM language models: forward and backward
how is this different to bi-directional RNNs?
Transfer: Put Representations Instead of Word Embeddings
this is a transfer because the CoVe can be trained on a task-specific dataset
how to encode not only individual words but words along with their context.
i guess this is more sophisticated than the context count that we saw right at the beginning with word embeddings
That's it!
the figure below is useful
If we use word embeddings, the vector for cat
embeddings are built by changing the embedding to increase the dot product between that word and its surrounding words. in this sense isn't it sort of like a special case of a language model? in that the model creates a probability distribution over sequences of words, i.e. how close two words are to each other in meaning
pretrained models
i.e. word embedding is a particular model that maps a word to a vector.
another example is a language model that, for a sequence of words, predicts the next token
entity extraction is another example of a pretrained model!
In a model, this transfer is implemented via replacing randomly initialized embeddings with the pretrained ones (what is the same, copying weights from pretrained embeddings to your model).
??
There are several types of transfer that are categorized nicely in Sebastian Ruder's blog post. Two large categories are transductive and inductive transfer learning: they divide all approaches into the ones where the task is the same and labels are only in the source (transductive), and where the tasks are different and labels are only in the target (inductive).
don't get this
NMT
neural machine translation
use the classifier's accuracy as a measure of how well representations encode labels
i.e. see whether a model can identify the meaning of these representations
Well, you can't - the quality will be much lower. You need many heads in training to let them learn all these useful things.
why is that, if they get pruned anyway?
While the rare tokens head surely looks fun, don't overestimate it - most probably, this is a sign of overfitting. By looking at the least frequent tokens, a model tries to hang on to these rare "clues".
think about this more
rare tokens: the most important head on the first layer attends to the least frequent tokens in a sentence (this is true for models trained on different language pairs!).
what does that mean exactly?
inductive bias in a model
wassat
Note that while BPE segmentation is deterministic, even with the same vocabulary a word can have different segmentations, e.g. un relat ed, u n relate d, un rel ated, etc.).
if it's deterministic, how could it have different segmentations?
Additionally, here LayerNorm has trainable parameters, scale and bias, which are used after normalization to rescale layer's outputs (or the next layer's inputs)
what for?
normalize vector representation of each token.
is this just ensure all vectors have same magnitude?
It independently normalizes vector representation
what exactly does vector normalisation mean here
The query is used when a token looks at others - it's seeking the information to understand itself better. The key is responding to a query's request: it is used to compute attention weights. The value is used to compute attention output: it gives information to the tokens which "say" they need it (i.e. assigned large weights to this token).
don't get this
Let's look at attention weights - which source words does the decoder use?
this is showing you what source token is heavily weighted when finding a given output
attention and its output c(t).
i.e. \(c^{t}\) is the context, which depends on the attention score
one decoder state
is the decoder state received the state from \(t-1\)?
what is more important, leads to worse quality.
again, how is it possible that it decreases the quality?
maybe because meaning decays on that length scale?
In reality, the exact solution is usually worse than the approximate ones we will be using.
what does that mean? how is it possible?
forces a model to give high probability not only to the target token but also to the words close to the target in the embedding space.
don't really get that
extrinsic
for a specific task, requires additional data
we sample from
in this case are the K top tokens all equally likely to be picked?
Clearly, these samples
maybe change temp as you go through sentence? seems like you might want to seed more variation early on, then becomes important to increase coherence
Generation Strategies
are sentences fed a prefix? I guess if you're just sampling from a distribution, you can get different sentences with no input.
hidden dimensionality of 1024 neurons
meaning? that the single layer has 1024 neurons? need to look into LSTMs
Residual connections are very simple: they add input of a block to its output.
surely a dof is needed to allow the network to downweight these connections... Then again, maybe it's okay because there are gates that these combinations pass through later
between the model prediction distribution p and the empirical target distribution p∗. With many training examples, this is close to minimizing the distance to the actual target distribution.
not sure I fully get that
the loss will be
which is the same as maximising the log likelihood of \((y_1,y_2...,y_n\)
(for the correct token yt), we will get
that is, we're trying to maximise the probability assigned to \(y_t\) – the correct next token/class given the context
I think the second equality below holds by definition of \(p_{y_t}\)
Those tokens whose output embeddings are closer to the text representation will receive larger probability.
this connects with the Word Embeddings lecture. we're basically finding the cosine similarity between the input and output embeddings
Applying the final linear layer is equivalent to evaluating the dot product between text representation h and each of the output word embeddings.
and note that output embeddings have same dimension as input embedding of the context... does it make sense that context (which is generally composed of many words) has the same dimensionality as tokens (typically single words)
then
how do we know that this softmax calculation of probability is sensible? probably need to look at a general ML lecture for this
linear layer
by training, this linear layer ensures that the V classes correspond to elements in the vocabulary. do we specify each class as being a particular element in the vocabulary?
the classes are vocabulary tokens.
i.e. next vocabulary token in the sequence
coefficients λi can be picked by cross-validation on the development set
are these constant across the language model? or do they somehow vary based on context
almost the same as the way we earlier estimated the probability to pick a green ball from a basket
i.e. almost 'just frequency based'
lternatively, you can apply greedy decoding: at each step,
don't really get the difference between these two options
a language model
i.e. a way of determining the probability of a word given context
We can not reliably estimate sentence probabilities if we treat them as atomic units
why?
very (very!) important
otherwise overfit?
extracts a different feature
how do you ensure that a different feature is extracted each time?
The only difference from SVMs in classical approaches (on top of bag-of-words and bag-of-ngrams) if the choice of a kernel: here the RBF kernel is better
don't know what this means
Maximum Likelihood Estimate (MLE) of model parameters
go back over probability course to remind what an MLE is
A good neural network will learn to represent input texts in such a way that text vectors will point in the direction of the corresponding class vectors.
but it seems like there's a degree of freedom in this, i.e. what you need is the vector representation of the input text together with the vector representations of the classes, w, to give the right probabilities as an output
Logistic Regression
is the thing that makes it logistic, the softmax curve?
softmax
need softmax because if any probability goes to zero, you can never get it bak
linear layer
just a non-square matrix multiplication, I think
which maximize the probability of the training data:
w comes in below in the calculation of the probabilities, as defined above
Maximum Likelihood Estimate (MLE) of the parameters.
i.e. pick parameters such that they maximise the joint probability?
feature representation of the input text
e.g. if it was BOW, \(f_i\) would be the number of occurrences of word i
in Naive Bayes, they had to
why?
defined how to use the features ourselves
i.e. we just used bayes' rule
feature extractor
presumably you could also use embeddings of words as features and it might perform better
Discriminative models are interested only in the conditional probability p(y|x), i.e. they learn only the border between classes.
think about why this is true, i.e. why just learning the conditional probability is equivalent to just finding boundaries between classes
Text Classification
relevant to NER and entity classification
Detect Words that Changed Their Usage
interesting (related to disambiguating entities). an obvious approach is to look at the nearest neighbours of identical words within the corpora and see how similar they are – either simply count based, i.e. how many of the nearest neighbours are the same, or cosine similarity, e.g. what is the average cosine similarity of the nearest neighbours of the word
Effect of Window Size
interesting to combine two different window sizes in some way
Repeat the derivations (loss and the gradients) for the case with one vector for each
at a glance, it looks like it would be the same except for a non-linearity in the sum for \(u_w\) when w is the central word. come back to this
distributed representations
why called this?
How the loss function and the gradients change for the CBOW model?
come back to this
sigmoid
why is sigmoid used here but not when not negative sampling?
context word
what does \(w\in V_{oc}\) mean below? and why does it disappear in fig 4 below?
updates change
presumably, the updates should be greater each time to compensate for the fact that each word is updated fewer times
negative
why neg samples? because they're being downweighted?
two vectors
why?
θ are all variables to be optimized
I thought Lena said above that the learned parameters are just the word vectors?
objective function
minimising the loss function is the same as maximising the likelihood. I.e. find the parameters \(\theta\) that will maximise the likelihood of the observation
compute probabilities of context words
how are these probabilities computed? clearly, that will determine what the suitable vectors are.
The objective forces word vectors to "know" contexts a word can appear in: the vectors are trained to predict possible contexts of the corresponding words
?
term-document
is this the matrix we saw above? terms on one axis, contexts on the other
documents
how is document defined? is it a sentence?
timestep
?
vector of size
presumably, this V is very big, right? e.g. 100000s of words. how do you determine what the tokens should be? i.e. what the domain of the output is
linear layer
what exactly is this?
transduction
?
Word vectors and similarity
come back to this
mean squared error (MSE)
seems analogous to the variance
θ
the problem is that this is in practice inaccessible!
choose Θ̂ Θ^\hat{\Theta} to be the sample mean
we need to choose a function on the sample that will estimate yield some desired characteristic of the population
n!fX(x1)fX(x2)⋯fX(xn)
presumably, this follows from the independence of the random variables (i.e. it's just the product of all the probabilities, with an \(n!\) to account for indistinguishability,
i.e. we don't care whether it's the first sample that's the smallest or the 15th, we only care what the size of the 15th largest value is
fX(i)(x)
the probability that the \(i^{th}\) value (from smallest to largest) has a value x
X(1)
random variable that gives the smallest value in a sample of n items
X⎯⎯⎯⎯−μσ/n‾√
transforming the variable X to standard normal
Properties of the sample mean
follow from the above?
limn→∞P(|X⎯⎯⎯⎯−μ|≥ϵ)=0
if you take an arbitrarily large sample, the probability of discrepancy between the sample mean and the true mean µ gets arbitrarily small
Var(X⎯⎯⎯⎯)=σ2n
Think about the proof for this
Mn(X)Mn(X)M_n(X) to indicate the distribution of XiXiX_i's.
i.e. it's a function of X
Mn
n is the number of samples used
FX1(x)=FX2(x)=...=FXn(x)=FX(x)FX1(x)=FX2(x)=...=FXn(x)=FX(x)F_{X_{\large{1}}}(x)=F_{X_{\large{2}}}(x)=...=F_{X_{\large{n}}}(x)=F_X(x); EXi=EX=μ<∞EXi=EX=μ<∞EX_i=EX=\mu<\infty; 0<Var(Xi)=Var(X)=σ2<∞0<Var(Xi)=Var(X)=σ2<∞0 < \mathrm{Var}(X_i)=\mathrm{Var}(X)=\sigma^2<\infty
don't 3 and 4 follow from 2?
In general, XiXiX_i is the height of the iiith person that is chosen uniformly and independently from the population.
what does uniformly and independently really mean?
you may say that frequentist (classical) inference deals with estimating non-random quantities, while Bayesian inference deals with estimating random variables.
this seems like a bad definition. What does it mean for a quantity to be random? The bit that is sent is not truly random. They seem to be different ways of conceiving of randomness.
At the receiver, XXX, which is a noisy version of ΘΘ\Theta, is received. The receiver has to recover ΘΘ\Theta from XXX
i.e. there's noise at X in addition to the randomness of \(\Theta\)
the unknown quantity ΘΘ\Theta is assumed to be a random variable, and we assume that we have some initial guess about the distribution of ΘΘ\Theta.
i.e. the outcome of the presidential election/percentage of people that will vote for A is undetermined, i.e. it is itself a random variable with some distribution
it depends on our random sample
is Y our random sample?
There is an unknown quantity that we would like to estimate. We get some data. From the data, we estimate the desired quantity
for example: the outcome of the presidential election. Get some polling data. From this data, estimate the most likely outcome.
drawing conclusions from data that are prone to random variation.
i.e. what is the truth/true state, given some data about the truth that imperfectly (randomly) reflects the true state
randomness comes from the noise
seems like randomness is any source of discrepancy between the 'truth' of something and the record of 'the truth' that we receive
randomness
what exactly does randomness mean in general?
Problem
skipped, again because not too worried about integrals at this stage
Problem
skipped
Problem
tricky
Gamma(1,λ)=Exponential(λ)Gamma(1,λ)=Exponential(λ)Gamma(1,\lambda) = Exponential(\lambda)
Gamma is to Exponential as Pascal is to Geometric (which makes sense, since we said that exponential is the continuous limit of Geometric)
σ2Y=a2σ2X.
note that this follows from the general equation for variance transformation
actually, it doesn't follow that these distributions are necessarily normal
μY=aμX+b
follows from linearity of expectation
1−Φ(x)
probability that X>x, i.e. goes to zero for large x as expected
Φ(0)=12Φ(0)=12\Phi(0)=\frac{1}{2};
if F(x) = 1/2, does that make x the mode or median or mean? – median!
This integral does not have a closed form solution.
I've never understood exactly what that means
we are integrating the standard normal PDF from −∞−∞-\infty to ∞∞\infty.
i.e. \(\int P(x) dx = 1\)
s the Central Limit Theorem (CLT) that we will discuss later in the book
read this – chapter 7.1.2
λp=Δλp=\Delta \lambda
thus \(\lambda\) is the probability of success per unit time
Var(X)=EX2−(EX)2=2λ2−1λ2=1λ2.
the bigger lambda, the more sharply peaked the distribution -> the smaller the variance
fX(x)=λe−λxu(x).
doesn't grow arbitrarily large at negative x
Problem
error
fX(x1)|g′(x1)|
i.e. if g(x)is steep at x_i, then the function spends 'less time' at x_i, so it makes sense that the PDF should be reduced.
This is not a problem, since P(X=0)=0
not just because it is a finite point, but also because it's a stationary point
where x1,x2,...,xnx1,x2,...,xnx_1, x_2,...,x_n are real solutions to g(x)=yg(x)=yg(x)=y.
for instance, in the graph above, consider the y-intercept to be 3.
Then there are independent solutions for g(x) = 3 in each of the 3 partitions – a, b, c from left to right – so that \(f_Y (y) = \) the sum of each of those 3 probabilities.
The Method of Transformations
didn't fully get; worth looking over if it proves useful
fX(x1)g′(x1)
where \(y = g(x_1)\)
To find the PDF of YYY, we differentiate fY(y)fY(y)f_Y(y) =ddyFX(x1)
this seems a dodgy proof; why are we differentiating wrt to the variable that parametrises our function
where g(x1)=y
unique point by monotonicity
X<g−1(y)
the region of X that corresponds to the above region of Y
g(X)≤y
the region of Y that's less than y
Note that since ggg is strictly increasing, its inverse function g−1g−1g^{-1} is well defined.
this is not the case for x^2 for instance, that's why sqrt(x) isn't well defined everywhere
fX(x1)g′(x1)
pdf of X at x_1 / derivative of g(x) wrt x at x_1 at an x_1 where y = g(x_1)
g(x)g(x)g(x) is a strictly increasing function
surely this isn't generally satisfied, e.g. for a uniform distribution over some interval, the cdf is not strictly increasing over all space
It is usually more straightforward to start from the CDF and then to find the PDF by taking the derivative of the CDF.
why is this generally true?
might not exactly show all possible values of X
why not?
More generally, for a set A
the probability that event A happens is the integral of the regions of X that correspond to A happening
absolutely continuous
meaning?
the CDF is a continuous function, i.e., it does not have any jumps.
is this still true if the PDF is only defined in certain regions of the real line?
Problem
not super happy with this one
This is because of symmetry: no marble is more likely to be chosen than the iiith marble as any other marbles.
but if the marble chosen is blue, doesn't this change the probability of picking a blue marble?
PMFs of DISTRIBUTIONS
x is the random variable in question.
Bernoulli(p): Indicator variable, which is 1 if the outcome occurs (e.g. heads) and 0 if not.
Geometric(x, p) = probability that you need to repeat the Bernoulli trial x times to get a success
Binomial(x, n, p) = probability that in n Bernoulli trials you get x successes.
Pascal(x, m, p) = probability that you need to repeat a Bernoulli trial x times to get m successes. This is a generalisation of the Geometric dist., where Pascal(x,1,p) = Geometric(x,p)
Hypergeometric(x,b,r) = the probability that you get x blue balls in a sample of k red and blue balls.
Poisson(x,L) = the probability that you get x events when you would normally get L.
ote that for a fixed kkk, we have
is it kosher to split up the limits like that? i.e. lim(ab) ?= lim(a)lim(b)?
PY(0)+PY(1)+PY(2)+PY(3))
ok because k=0 and k=1 and k=2 etc. are mutually exclusive, so their intersection is just the sum of the probabilities of their occurence
why can't you renormalise the distribution by just saying it's the same as getting at least one email every 10/3 minutes?
A random variable X
probability that X (number of customers that enter the shop) = k in the time when, on average, it should be λ
parameter λ
most likely value
Poisson(5)Poisson(5)Poisson(5) random variable.
why is the value below always equal in probability to the mean (is that true?)
hoose xxx blue marbles and k−xk−xk-x red marbles is (bx)(rk−x)(bx)(rk−x){b \choose x} {r \choose k-x}
why exactly
k
probability that I throw the coin k times, if I want to observe m heads
Pascal random variable with parameters mmm and p
k is number of times you throw the coin, m is the number of times you see heads that you want to know the probability of, p is the probability of heads
P(B)
binomal
X=X1+X2+...+XnX=X1+X2+...+XnX=X_1+X_2+...+X_n
note that if X~P(v) and Y~Q(v), X+Y ~! P(v)+Q(v) in general.
We usually define q=1−pq=1−pq=1-p, so we can write PX(k)=pqk−1, for k=1,2,3,...PX(k)=pqk−1, for k=1,2,3,...P_X(k)=pq^{k-1}, \textrm{ for } k=1,2,3,.... To say that a random variable has geometric distribution with parameter ppp, we write X∼Geometric(p)X∼Geometric(p)X \sim Geometric(p). More formally, we have the following definition
Q: Is there a way to define a geometric dist if the underlying trials can have more than 2 outcomes? E.g. would you use this if throwing a die, for the number of times you had to throw until you got a 5?
the indicator random variable IAIAI_A for an event A
so each event would have to have its own indicator random variable; if S = {a,b,c,d} then X_i = {1 if i, 0 otherwise}, for i in S.
A Bernoulli random variable is a random variable that can only take two possible values, usually 000 and 111.
for instance, if your random experiment is throwing a coin once, then S is {H, T} and X:S->R is {0,1}.
Question: does the notation f:S->R mean that each element in S maps to one element in R? I think yes: but in general, you may have multiple elements mapping to the same values in the co-domain.
some specific distributions that are used over and over in practice,
for instance, I wrote down the distribution P(y) = p(1-p)^y for the probability of throwing a coin y times before getting a head, when the probability of getting a head is p.
E[X2]−μ2X.
nothing disappears to get the nicer form of the variance, it's just that the E[X]^2 terms subtract
If we already know the PMF of XXX, to find the PMF of Y=g(X)Y=g(X)Y=g(X), we can write PY(y)=P(Y=y)=P(g(X)=y)=∑x:g(x)=yPX(x)
intuitive: to find the probability that Y = y, add up the probabilities of all the x's.
for any set A⊂RX
pretty sure this is also true for normal probabilities, why is it being stated explicitly here?
The phrase distribution function is usually reserved exclusively for the cumulative distribution function CDF
why?
The function
just gives the probabilities of each of the countable outcomes for X
For a discrete random variable XXX, we are interested in knowing the probabilities of X=xkX=xkX=x_k.
Essentially, a RV lets us take the things in the real world (events, such as a sequence of heads or tails) and turn it into a new thing that behaves in the same way, i.e. the output of the function X:S->RealNos can be thought of kind of as a new sample space, and we can assign probabilities to it.
cron jobs
chronological/scheduled jobs
As a website owner, it’s your responsibility to keep your WordPress site, theme, and plugins updated to the latest versions.
site adminsitration
not conditionally independent given CCC.
given the additional info that C happened, we can no longer treat A and B as independent. I.e. the additional info that C leads to a dependence between A and B.
One important lesson here is that, generally speaking, conditional independence neither implies (nor is it implied by) independence.
Just because some events are conditionally independent, doesn't mean they're unconditionally independent, and v.v. This makes sense because in fact there are always conditions on events, if only implicitly, so if conditional indpendence doesn't imply unconditional independence, the converse shouldn't hold either.
P(A|C)=P(A|C)=P(A|C).
common sense generalisation of P(A|B) = P(A) for A independent of B.
P(A)P(B|A)P(C|A,B).
probability that A n B n C is just the (probability that C happened given that AnB happened) x P(AnB) happened
P(A)P(B|A
P(A,B)
rewritten
the key idea is basically that the sample set is reduced to C
range of a function is always a subset
why not just have Range(f) = codomain(f)
Functions
paused here