Hypothesis

168 Matching Annotations

Jan 2023
lena-voita.github.io lena-voita.github.io

Transfer Learning

8
1. nikostro 23 Jan 2023
  
  in Public
  
  two-layer LSTM language models: forward and backward
  
  how is this different to bi-directional RNNs?
2. nikostro 23 Jan 2023
  
  in Public
  
  Transfer: Put Representations Instead of Word Embeddings
  
  this is a transfer because the CoVe can be trained on a task-specific dataset
3. nikostro 23 Jan 2023
  
  in Public
  
  how to encode not only individual words but words along with their context.
  
  i guess this is more sophisticated than the context count that we saw right at the beginning with word embeddings
4. nikostro 23 Jan 2023
  
  in Public
  
  That's it!
  
  the figure below is useful
5. nikostro 23 Jan 2023
  
  in Public
  
  If we use word embeddings, the vector for cat
  
  embeddings are built by changing the embedding to increase the dot product between that word and its surrounding words. in this sense isn't it sort of like a special case of a language model? in that the model creates a probability distribution over sequences of words, i.e. how close two words are to each other in meaning
6. nikostro 23 Jan 2023
  
  in Public
  
  pretrained models
  
  i.e. word embedding is a particular model that maps a word to a vector.
  
  another example is a language model that, for a sequence of words, predicts the next token
  
  entity extraction is another example of a pretrained model!
7. nikostro 23 Jan 2023
  
  in Public
  
  In a model, this transfer is implemented via replacing randomly initialized embeddings with the pretrained ones (what is the same, copying weights from pretrained embeddings to your model).
  
  ??
8. nikostro 23 Jan 2023
  
  in Public
  
  There are several types of transfer that are categorized nicely in Sebastian Ruder's blog post. Two large categories are transductive and inductive transfer learning: they divide all approaches into the ones where the task is the same and labels are only in the source (transductive), and where the tasks are different and labels are only in the target (inductive).
  
  don't get this
Visit annotations in context

Annotators

nikostro

URL

lena-voita.github.io/nlp_course/transfer_learning.html
lena-voita.github.io lena-voita.github.io

Seq2seq and Attention

16
1. nikostro 20 Jan 2023
  
  in Public
  
  NMT
  
  neural machine translation
2. nikostro 20 Jan 2023
  
  in Public
  
  use the classifier's accuracy as a measure of how well representations encode labels
  
  i.e. see whether a model can identify the meaning of these representations
3. nikostro 20 Jan 2023
  
  in Public
  
  Well, you can't - the quality will be much lower. You need many heads in training to let them learn all these useful things.
  
  why is that, if they get pruned anyway?
4. nikostro 20 Jan 2023
  
  in Public
  
  While the rare tokens head surely looks fun, don't overestimate it - most probably, this is a sign of overfitting. By looking at the least frequent tokens, a model tries to hang on to these rare "clues".
  
  think about this more
5. nikostro 20 Jan 2023
  
  in Public
  
  rare tokens: the most important head on the first layer attends to the least frequent tokens in a sentence (this is true for models trained on different language pairs!).
  
  what does that mean exactly?
6. nikostro 20 Jan 2023
  
  in Public
  
  inductive bias in a model
  
  wassat
7. nikostro 20 Jan 2023
  
  in Public
  
  Note that while BPE segmentation is deterministic, even with the same vocabulary a word can have different segmentations, e.g. un relat ed, u n relate d, un rel ated, etc.).
  
  if it's deterministic, how could it have different segmentations?
8. nikostro 20 Jan 2023
  
  in Public
  
  Additionally, here LayerNorm has trainable parameters, scale and bias, which are used after normalization to rescale layer's outputs (or the next layer's inputs)
  
  what for?
9. nikostro 20 Jan 2023
  
  in Public
  
  normalize vector representation of each token.
  
  is this just ensure all vectors have same magnitude?
10. nikostro 20 Jan 2023
  
  in Public
  
  It independently normalizes vector representation
  
  what exactly does vector normalisation mean here
11. nikostro 20 Jan 2023
  
  in Public
  
  The query is used when a token looks at others - it's seeking the information to understand itself better. The key is responding to a query's request: it is used to compute attention weights. The value is used to compute attention output: it gives information to the tokens which "say" they need it (i.e. assigned large weights to this token).
  
  don't get this
12. nikostro 20 Jan 2023
  
  in Public
  
  Let's look at attention weights - which source words does the decoder use?
  
  this is showing you what source token is heavily weighted when finding a given output
13. nikostro 20 Jan 2023
  
  in Public
  
  attention and its output c(t).
  
  i.e. \(c^{t}\) is the context, which depends on the attention score
14. nikostro 19 Jan 2023
  
  in Public
  
  one decoder state
  
  is the decoder state received the state from \(t-1\)?
15. nikostro 19 Jan 2023
  
  in Public
  
  what is more important, leads to worse quality.
  
  again, how is it possible that it decreases the quality?
  
  maybe because meaning decays on that length scale?
16. nikostro 19 Jan 2023
  
  in Public
  
  In reality, the exact solution is usually worse than the approximate ones we will be using.
  
  what does that mean? how is it possible?
Visit annotations in context

Annotators

nikostro

URL

lena-voita.github.io/nlp_course/seq2seq_and_attention.html
lena-voita.github.io lena-voita.github.io

Language Modeling

20
1. nikostro 11 Jan 2023
  
  in Public
  
  forces a model to give high probability not only to the target token but also to the words close to the target in the embedding space.
  
  don't really get that
2. nikostro 11 Jan 2023
  
  in Public
  
  extrinsic
  
  for a specific task, requires additional data
3. nikostro 11 Jan 2023
  
  in Public
  
  we sample from
  
  in this case are the K top tokens all equally likely to be picked?
4. nikostro 10 Jan 2023
  
  in Public
  
  Clearly, these samples
  
  maybe change temp as you go through sentence? seems like you might want to seed more variation early on, then becomes important to increase coherence
5. nikostro 10 Jan 2023
  
  in Public
  
  Generation Strategies
  
  are sentences fed a prefix? I guess if you're just sampling from a distribution, you can get different sentences with no input.
6. nikostro 10 Jan 2023
  
  in Public
  
  hidden dimensionality of 1024 neurons
  
  meaning? that the single layer has 1024 neurons? need to look into LSTMs
7. nikostro 10 Jan 2023
  
  in Public
  
  Residual connections are very simple: they add input of a block to its output.
  
  surely a dof is needed to allow the network to downweight these connections... Then again, maybe it's okay because there are gates that these combinations pass through later
8. nikostro 10 Jan 2023
  
  in Public
  
  between the model prediction distribution p and the empirical target distribution p∗. With many training examples, this is close to minimizing the distance to the actual target distribution.
  
  not sure I fully get that
9. nikostro 10 Jan 2023
  
  in Public
  
  the loss will be
  
  which is the same as maximising the log likelihood of \((y_1,y_2...,y_n\)
10. nikostro 10 Jan 2023
  
  in Public
  
  (for the correct token yt), we will get
  
  that is, we're trying to maximise the probability assigned to \(y_t\) – the correct next token/class given the context
  
  I think the second equality below holds by definition of \(p_{y_t}\)
11. nikostro 10 Jan 2023
  
  in Public
  
  Those tokens whose output embeddings are closer to the text representation will receive larger probability.
  
  this connects with the Word Embeddings lecture. we're basically finding the cosine similarity between the input and output embeddings
12. nikostro 10 Jan 2023
  
  in Public
  
  Applying the final linear layer is equivalent to evaluating the dot product between text representation h and each of the output word embeddings.
  
  and note that output embeddings have same dimension as input embedding of the context... does it make sense that context (which is generally composed of many words) has the same dimensionality as tokens (typically single words)
13. nikostro 10 Jan 2023
  
  in Public
  
  then
  
  how do we know that this softmax calculation of probability is sensible? probably need to look at a general ML lecture for this
14. nikostro 10 Jan 2023
  
  in Public
  
  linear layer
  
  by training, this linear layer ensures that the V classes correspond to elements in the vocabulary. do we specify each class as being a particular element in the vocabulary?
15. nikostro 10 Jan 2023
  
  in Public
  
  the classes are vocabulary tokens.
  
  i.e. next vocabulary token in the sequence
16. nikostro 10 Jan 2023
  
  in Public
  
  coefficients λi can be picked by cross-validation on the development set
  
  are these constant across the language model? or do they somehow vary based on context
17. nikostro 10 Jan 2023
  
  in Public
  
  almost the same as the way we earlier estimated the probability to pick a green ball from a basket
  
  i.e. almost 'just frequency based'
18. nikostro 10 Jan 2023
  
  in Public
  
  lternatively, you can apply greedy decoding: at each step,
  
  don't really get the difference between these two options
19. nikostro 10 Jan 2023
  
  in Public
  
  a language model
  
  i.e. a way of determining the probability of a word given context
20. nikostro 10 Jan 2023
  
  in Public
  
  We can not reliably estimate sentence probabilities if we treat them as atomic units
  
  why?
Visit annotations in context

Annotators

nikostro

URL

lena-voita.github.io/nlp_course/language_modeling.html
lena-voita.github.io lena-voita.github.io

Text Classification

16
1. nikostro 09 Jan 2023
  
  in Public
  
  very (very!) important
  
  otherwise overfit?
2. nikostro 09 Jan 2023
  
  in Public
  
  extracts a different feature
  
  how do you ensure that a different feature is extracted each time?
3. nikostro 09 Jan 2023
  
  in Public
  
  The only difference from SVMs in classical approaches (on top of bag-of-words and bag-of-ngrams) if the choice of a kernel: here the RBF kernel is better
  
  don't know what this means
4. nikostro 09 Jan 2023
  
  in Public
  
  Maximum Likelihood Estimate (MLE) of model parameters
  
  go back over probability course to remind what an MLE is
5. nikostro 09 Jan 2023
  
  in Public
  
  A good neural network will learn to represent input texts in such a way that text vectors will point in the direction of the corresponding class vectors.
  
  but it seems like there's a degree of freedom in this, i.e. what you need is the vector representation of the input text together with the vector representations of the classes, w, to give the right probabilities as an output
6. nikostro 09 Jan 2023
  
  in Public
  
  Logistic Regression
  
  is the thing that makes it logistic, the softmax curve?
7. nikostro 09 Jan 2023
  
  in Public
  
  softmax
  
  need softmax because if any probability goes to zero, you can never get it bak
8. nikostro 09 Jan 2023
  
  in Public
  
  linear layer
  
  just a non-square matrix multiplication, I think
9. nikostro 09 Jan 2023
  
  in Public
  
  which maximize the probability of the training data:
  
  w comes in below in the calculation of the probabilities, as defined above
10. nikostro 09 Jan 2023
  
  in Public
  
  Maximum Likelihood Estimate (MLE) of the parameters.
  
  i.e. pick parameters such that they maximise the joint probability?
11. nikostro 09 Jan 2023
  
  in Public
  
  feature representation of the input text
  
  e.g. if it was BOW, \(f_i\) would be the number of occurrences of word i
12. nikostro 09 Jan 2023
  
  in Public
  
  in Naive Bayes, they had to
  
  why?
13. nikostro 09 Jan 2023
  
  in Public
  
  defined how to use the features ourselves
  
  i.e. we just used bayes' rule
14. nikostro 09 Jan 2023
  
  in Public
  
  feature extractor
  
  presumably you could also use embeddings of words as features and it might perform better
15. nikostro 09 Jan 2023
  
  in Public
  
  Discriminative models are interested only in the conditional probability p(y|x), i.e. they learn only the border between classes.
  
  think about why this is true, i.e. why just learning the conditional probability is equivalent to just finding boundaries between classes
16. nikostro 09 Jan 2023
  
  in Public
  
  Text Classification
  
  relevant to NER and entity classification
Visit annotations in context

Annotators

nikostro

URL

lena-voita.github.io/nlp_course/text_classification.html
lena-voita.github.io lena-voita.github.io

Word Embeddings

16
1. nikostro 08 Jan 2023
  
  in Public
  
  Detect Words that Changed Their Usage
  
  interesting (related to disambiguating entities). an obvious approach is to look at the nearest neighbours of identical words within the corpora and see how similar they are – either simply count based, i.e. how many of the nearest neighbours are the same, or cosine similarity, e.g. what is the average cosine similarity of the nearest neighbours of the word
2. nikostro 08 Jan 2023
  
  in Public
  
  Effect of Window Size
  
  interesting to combine two different window sizes in some way
3. nikostro 08 Jan 2023
  
  in Public
  
  Repeat the derivations (loss and the gradients) for the case with one vector for each
  
  at a glance, it looks like it would be the same except for a non-linearity in the sum for \(u_w\) when w is the central word. come back to this
4. nikostro 08 Jan 2023
  
  in Public
  
  distributed representations
  
  why called this?
5. nikostro 08 Jan 2023
  
  in Public
  
  How the loss function and the gradients change for the CBOW model?
  
  come back to this
6. nikostro 08 Jan 2023
  
  in Public
  
  sigmoid
  
  why is sigmoid used here but not when not negative sampling?
7. nikostro 08 Jan 2023
  
  in Public
  
  context word
  
  what does \(w\in V_{oc}\) mean below? and why does it disappear in fig 4 below?
8. nikostro 08 Jan 2023
  
  in Public
  
  updates change
  
  presumably, the updates should be greater each time to compensate for the fact that each word is updated fewer times
9. nikostro 08 Jan 2023
  
  in Public
  
  negative
  
  why neg samples? because they're being downweighted?
10. nikostro 08 Jan 2023
  
  in Public
  
  two vectors
  
  why?
11. nikostro 08 Jan 2023
  
  in Public
  
  θ are all variables to be optimized
  
  I thought Lena said above that the learned parameters are just the word vectors?
12. nikostro 08 Jan 2023
  
  in Public
  
  objective function
  
  minimising the loss function is the same as maximising the likelihood. I.e. find the parameters \(\theta\) that will maximise the likelihood of the observation
13. nikostro 08 Jan 2023
  
  in Public
  
  compute probabilities of context words
  
  how are these probabilities computed? clearly, that will determine what the suitable vectors are.
14. nikostro 08 Jan 2023
  
  in Public
  
  The objective forces word vectors to "know" contexts a word can appear in: the vectors are trained to predict possible contexts of the corresponding words
  
  ?
15. nikostro 08 Jan 2023
  
  in Public
  
  term-document
  
  is this the matrix we saw above? terms on one axis, contexts on the other
16. nikostro 08 Jan 2023
  
  in Public
  
  documents
  
  how is document defined? is it a sentence?
Visit annotations in context

Annotators

nikostro

URL

lena-voita.github.io/nlp_course/word_embeddings.html
Oct 2022
lena-voita.github.io lena-voita.github.io

Seq2seq and Attention

3
1. nikostro 24 Oct 2022
  
  in Public
  
  timestep
  
  ?
2. nikostro 24 Oct 2022
  
  in Public
  
  vector of size
  
  presumably, this V is very big, right? e.g. 100000s of words. how do you determine what the tokens should be? i.e. what the domain of the output is
3. nikostro 24 Oct 2022
  
  in Public
  
  linear layer
  
  what exactly is this?
Visit annotations in context

Annotators

nikostro

URL

lena-voita.github.io/nlp_course/seq2seq_and_attention.html
nlp.seas.harvard.edu nlp.seas.harvard.edu

The Annotated Transformer

1
1. nikostro 24 Oct 2022
  
  in Public
  
  transduction
  
  ?
Visit annotations in context

Annotators

nikostro

URL

nlp.seas.harvard.edu/annotated-transformer/
spacy.io spacy.io

spaCy.io

1
1. nikostro 18 Oct 2022
  
  in Public
  
  Word vectors and similarity
  
  come back to this
Visit annotations in context

Annotators

nikostro

URL

spacy.io/usage/spacy-101
Sep 2022
www.probabilitycourse.com www.probabilitycourse.com

Evaluating Estimators

2
1. nikostro 22 Sep 2022
  
  in Public
  
  mean squared error (MSE)
  
  seems analogous to the variance
2. nikostro 22 Sep 2022
  
  in Public
  
  θ
  
  the problem is that this is in practice inaccessible!
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter8/8_2_1_evaluating_estimators.php
www.probabilitycourse.com www.probabilitycourse.com

Point Estimation

1
1. nikostro 22 Sep 2022
  
  in Public
  
  choose Θ̂ Θ^\hat{\Theta} to be the sample mean
  
  we need to choose a function on the sample that will estimate yield some desired characteristic of the population
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter8/8_2_0_point_estimation.php
www.probabilitycourse.com www.probabilitycourse.com

Random Sampling

11
1. nikostro 22 Sep 2022
  
  in Public
  
  n!fX(x1)fX(x2)⋯fX(xn)
  
  presumably, this follows from the independence of the random variables (i.e. it's just the product of all the probabilities, with an \(n!\) to account for indistinguishability,
  
  i.e. we don't care whether it's the first sample that's the smallest or the 15th, we only care what the size of the 15th largest value is
2. nikostro 22 Sep 2022
  
  in Public
  
  fX(i)(x)
  
  the probability that the \(i^{th}\) value (from smallest to largest) has a value x
3. nikostro 22 Sep 2022
  
  in Public
  
  X(1)
  
  random variable that gives the smallest value in a sample of n items
4. nikostro 22 Sep 2022
  
  in Public
  
  X⎯⎯⎯⎯−μσ/n‾√
  
  transforming the variable X to standard normal
5. nikostro 22 Sep 2022
  
  in Public
  
  Properties of the sample mean
  
  follow from the above?
6. nikostro 22 Sep 2022
  
  in Public
  
  limn→∞P(|X⎯⎯⎯⎯−μ|≥ϵ)=0
  
  if you take an arbitrarily large sample, the probability of discrepancy between the sample mean and the true mean µ gets arbitrarily small
7. nikostro 22 Sep 2022
  
  in Public
  
  Var(X⎯⎯⎯⎯)=σ2n
  
  Think about the proof for this
8. nikostro 22 Sep 2022
  
  in Public
  
  Mn(X)Mn(X)M_n(X) to indicate the distribution of XiXiX_i's.
  
  i.e. it's a function of X
9. nikostro 22 Sep 2022
  
  in Public
  
  Mn
  
  n is the number of samples used
10. nikostro 22 Sep 2022
  
  in Public
  
  FX1(x)=FX2(x)=...=FXn(x)=FX(x)FX1(x)=FX2(x)=...=FXn(x)=FX(x)F_{X_{\large{1}}}(x)=F_{X_{\large{2}}}(x)=...=F_{X_{\large{n}}}(x)=F_X(x); EXi=EX=μ<∞EXi=EX=μ<∞EX_i=EX=\mu<\infty; 0<Var(Xi)=Var(X)=σ2<∞0<Var(Xi)=Var(X)=σ2<∞0 < \mathrm{Var}(X_i)=\mathrm{Var}(X)=\sigma^2<\infty
  
  don't 3 and 4 follow from 2?
11. nikostro 22 Sep 2022
  
  in Public
  
  In general, XiXiX_i is the height of the iiith person that is chosen uniformly and independently from the population.
  
  what does uniformly and independently really mean?
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter8/8_1_1_random_sampling.php
www.probabilitycourse.com www.probabilitycourse.com

Introduction

8
1. nikostro 22 Sep 2022
  
  in Public
  
  you may say that frequentist (classical) inference deals with estimating non-random quantities, while Bayesian inference deals with estimating random variables.
  
  this seems like a bad definition. What does it mean for a quantity to be random? The bit that is sent is not truly random. They seem to be different ways of conceiving of randomness.
2. nikostro 22 Sep 2022
  
  in Public
  
  At the receiver, XXX, which is a noisy version of ΘΘ\Theta, is received. The receiver has to recover ΘΘ\Theta from XXX
  
  i.e. there's noise at X in addition to the randomness of \(\Theta\)
3. nikostro 22 Sep 2022
  
  in Public
  
  the unknown quantity ΘΘ\Theta is assumed to be a random variable, and we assume that we have some initial guess about the distribution of ΘΘ\Theta.
  
  i.e. the outcome of the presidential election/percentage of people that will vote for A is undetermined, i.e. it is itself a random variable with some distribution
4. nikostro 22 Sep 2022
  
  in Public
  
  it depends on our random sample
  
  is Y our random sample?
5. nikostro 22 Sep 2022
  
  in Public
  
  There is an unknown quantity that we would like to estimate. We get some data. From the data, we estimate the desired quantity
  
  for example: the outcome of the presidential election. Get some polling data. From this data, estimate the most likely outcome.
6. nikostro 22 Sep 2022
  
  in Public
  
  drawing conclusions from data that are prone to random variation.
  
  i.e. what is the truth/true state, given some data about the truth that imperfectly (randomly) reflects the true state
7. nikostro 22 Sep 2022
  
  in Public
  
  randomness comes from the noise
  
  seems like randomness is any source of discrepancy between the 'truth' of something and the record of 'the truth' that we receive
8. nikostro 22 Sep 2022
  
  in Public
  
  randomness
  
  what exactly does randomness mean in general?
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter8/8_1_0_intro.php
www.probabilitycourse.com www.probabilitycourse.com

Solved Problems | Special Continuous Distributions

3
1. nikostro 20 Sep 2022
  
  in Public
  
  Problem
  
  skipped, again because not too worried about integrals at this stage
2. nikostro 20 Sep 2022
  
  in Public
  
  Problem
  
  skipped
3. nikostro 20 Sep 2022
  
  in Public
  
  Problem
  
  tricky
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_2_6_solved4_2.php
www.probabilitycourse.com www.probabilitycourse.com

Gamma Distribution | Gamma Function | Properties | PDF

1
1. nikostro 15 Sep 2022
  
  in Public
  
  Gamma(1,λ)=Exponential(λ)Gamma(1,λ)=Exponential(λ)Gamma(1,\lambda) = Exponential(\lambda)
  
  Gamma is to Exponential as Pascal is to Geometric (which makes sense, since we said that exponential is the continuous limit of Geometric)
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_2_4_Gamma_distribution.php
www.probabilitycourse.com www.probabilitycourse.com

Normal Distribution | Gaussian | Normal random variables | PDF

7
1. nikostro 15 Sep 2022
  
  in Public
  
  σ2Y=a2σ2X.
  
  note that this follows from the general equation for variance transformation
  
  actually, it doesn't follow that these distributions are necessarily normal
2. nikostro 15 Sep 2022
  
  in Public
  
  μY=aμX+b
  
  follows from linearity of expectation
3. nikostro 15 Sep 2022
  
  in Public
  
  1−Φ(x)
  
  probability that X>x, i.e. goes to zero for large x as expected
4. nikostro 15 Sep 2022
  
  in Public
  
  Φ(0)=12Φ(0)=12\Phi(0)=\frac{1}{2};
  
  if F(x) = 1/2, does that make x the mode or median or mean? – median!
5. nikostro 15 Sep 2022
  
  in Public
  
  This integral does not have a closed form solution.
  
  I've never understood exactly what that means
6. nikostro 15 Sep 2022
  
  in Public
  
  we are integrating the standard normal PDF from −∞−∞-\infty to ∞∞\infty.
  
  i.e. \(\int P(x) dx = 1\)
7. nikostro 15 Sep 2022
  
  in Public
  
  s the Central Limit Theorem (CLT) that we will discuss later in the book
  
  read this – chapter 7.1.2
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_2_3_normal.php
www.probabilitycourse.com www.probabilitycourse.com

Exponential Distribution | Definition | Memoryless Random Variable

3
1. nikostro 15 Sep 2022
  
  in Public
  
  λp=Δλp=\Delta \lambda
  
  thus \(\lambda\) is the probability of success per unit time
2. nikostro 15 Sep 2022
  
  in Public
  
  Var(X)=EX2−(EX)2=2λ2−1λ2=1λ2.
  
  the bigger lambda, the more sharply peaked the distribution -> the smaller the variance
3. nikostro 15 Sep 2022
  
  in Public
  
  fX(x)=λe−λxu(x).
  
  doesn't grow arbitrarily large at negative x
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_2_2_exponential.php
www.probabilitycourse.com www.probabilitycourse.com

Solved problems | Continuous random variables

1
1. nikostro 12 Sep 2022
  
  in Public
  
  Problem
  
  error
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_1_4_solved4_1.php
www.probabilitycourse.com www.probabilitycourse.com

Functions of Continuous Random Variables | PDF | CDF

13
1. nikostro 12 Sep 2022
  
  in Public
  
  fX(x1)|g′(x1)|
  
  i.e. if g(x)is steep at x_i, then the function spends 'less time' at x_i, so it makes sense that the PDF should be reduced.
2. nikostro 12 Sep 2022
  
  in Public
  
  This is not a problem, since P(X=0)=0
  
  not just because it is a finite point, but also because it's a stationary point
3. nikostro 12 Sep 2022
  
  in Public
  
  where x1,x2,...,xnx1,x2,...,xnx_1, x_2,...,x_n are real solutions to g(x)=yg(x)=yg(x)=y.
  
  for instance, in the graph above, consider the y-intercept to be 3.
  
  Then there are independent solutions for g(x) = 3 in each of the 3 partitions – a, b, c from left to right – so that \(f_Y (y) = \) the sum of each of those 3 probabilities.
4. nikostro 12 Sep 2022
  
  in Public
  
  The Method of Transformations
  
  didn't fully get; worth looking over if it proves useful
5. nikostro 12 Sep 2022
  
  in Public
  
  fX(x1)g′(x1)
  
  where \(y = g(x_1)\)
6. nikostro 12 Sep 2022
  
  in Public
  
  To find the PDF of YYY, we differentiate fY(y)fY(y)f_Y(y) =ddyFX(x1)
  
  this seems a dodgy proof; why are we differentiating wrt to the variable that parametrises our function
7. nikostro 12 Sep 2022
  
  in Public
  
  where g(x1)=y
  
  unique point by monotonicity
8. nikostro 12 Sep 2022
  
  in Public
  
  X<g−1(y)
  
  the region of X that corresponds to the above region of Y
9. nikostro 12 Sep 2022
  
  in Public
  
  g(X)≤y
  
  the region of Y that's less than y
10. nikostro 12 Sep 2022
  
  in Public
  
  Note that since ggg is strictly increasing, its inverse function g−1g−1g^{-1} is well defined.
  
  this is not the case for x^2 for instance, that's why sqrt(x) isn't well defined everywhere
11. nikostro 12 Sep 2022
  
  in Public
  
  fX(x1)g′(x1)
  
  pdf of X at x_1 / derivative of g(x) wrt x at x_1 at an x_1 where y = g(x_1)
12. nikostro 12 Sep 2022
  
  in Public
  
  g(x)g(x)g(x) is a strictly increasing function
  
  surely this isn't generally satisfied, e.g. for a uniform distribution over some interval, the cdf is not strictly increasing over all space
13. nikostro 12 Sep 2022
  
  in Public
  
  It is usually more straightforward to start from the CDF and then to find the PDF by taking the derivative of the CDF.
  
  why is this generally true?
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_1_3_functions_continuous_var.php
www.probabilitycourse.com www.probabilitycourse.com

Probability Density Function | PDF | Distributions

3
1. nikostro 11 Sep 2022
  
  in Public
  
  might not exactly show all possible values of X
  
  why not?
2. nikostro 11 Sep 2022
  
  in Public
  
  More generally, for a set A
  
  the probability that event A happens is the integral of the regions of X that correspond to A happening
3. nikostro 11 Sep 2022
  
  in Public
  
  absolutely continuous
  
  meaning?
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_1_1_pdf.php
www.probabilitycourse.com www.probabilitycourse.com

Continuous Random Variables and their Distributions

1
1. nikostro 11 Sep 2022
  
  in Public
  
  the CDF is a continuous function, i.e., it does not have any jumps.
  
  is this still true if the PDF is only defined in certain regions of the real line?
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter4/4_1_0_continuous_random_vars_distributions.php
www.probabilitycourse.com www.probabilitycourse.com

More Discrete Random Variable Solved Problems

2
1. nikostro 11 Sep 2022
  
  in Public
  
  Problem
  
  not super happy with this one
2. nikostro 11 Sep 2022
  
  in Public
  
  This is because of symmetry: no marble is more likely to be chosen than the iiith marble as any other marbles.
  
  but if the marble chosen is blue, doesn't this change the probability of picking a blue marble?
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter3/3_2_5_solved3_2.php
www.probabilitycourse.com www.probabilitycourse.com

Special Distributions | Bernoulli Distribution | Geometric Distribution | Binomial Distribution | Pascal Distribution | Poisson Distribution

15
1. nikostro 11 Sep 2022
  
  in Public
  
  PMFs of DISTRIBUTIONS
  
  x is the random variable in question.
  
  Bernoulli(p): Indicator variable, which is 1 if the outcome occurs (e.g. heads) and 0 if not.
  
  Geometric(x, p) = probability that you need to repeat the Bernoulli trial x times to get a success
  
  Binomial(x, n, p) = probability that in n Bernoulli trials you get x successes.
  
  Pascal(x, m, p) = probability that you need to repeat a Bernoulli trial x times to get m successes. This is a generalisation of the Geometric dist., where Pascal(x,1,p) = Geometric(x,p)
  
  Hypergeometric(x,b,r) = the probability that you get x blue balls in a sample of k red and blue balls.
  
  Poisson(x,L) = the probability that you get x events when you would normally get L.
2. nikostro 10 Sep 2022
  
  in Public
  
  ote that for a fixed kkk, we have
  
  is it kosher to split up the limits like that? i.e. lim(ab) ?= lim(a)lim(b)?
3. nikostro 10 Sep 2022
  
  in Public
  
  PY(0)+PY(1)+PY(2)+PY(3))
  
  ok because k=0 and k=1 and k=2 etc. are mutually exclusive, so their intersection is just the sum of the probabilities of their occurence
  
  why can't you renormalise the distribution by just saying it's the same as getting at least one email every 10/3 minutes?
4. nikostro 10 Sep 2022
  
  in Public
  
  A random variable X
  
  probability that X (number of customers that enter the shop) = k in the time when, on average, it should be λ
5. nikostro 10 Sep 2022
  
  in Public
  
  parameter λ
  
  most likely value
6. nikostro 10 Sep 2022
  
  in Public
  
  Poisson(5)Poisson(5)Poisson(5) random variable.
  
  why is the value below always equal in probability to the mean (is that true?)
7. nikostro 10 Sep 2022
  
  in Public
  
  hoose xxx blue marbles and k−xk−xk-x red marbles is (bx)(rk−x)(bx)(rk−x){b \choose x} {r \choose k-x}
  
  why exactly
8. nikostro 10 Sep 2022
  
  in Public
  
  k
  
  probability that I throw the coin k times, if I want to observe m heads
9. nikostro 10 Sep 2022
  
  in Public
  
  Pascal random variable with parameters mmm and p
  
  k is number of times you throw the coin, m is the number of times you see heads that you want to know the probability of, p is the probability of heads
10. nikostro 10 Sep 2022
  
  in Public
  
  P(B)
  
  binomal
11. nikostro 10 Sep 2022
  
  in Public
  
  X=X1+X2+...+XnX=X1+X2+...+XnX=X_1+X_2+...+X_n
  
  note that if X~P(v) and Y~Q(v), X+Y ~! P(v)+Q(v) in general.
12. nikostro 10 Sep 2022
  
  in Public
  
  We usually define q=1−pq=1−pq=1-p, so we can write PX(k)=pqk−1, for k=1,2,3,...PX(k)=pqk−1, for k=1,2,3,...P_X(k)=pq^{k-1}, \textrm{ for } k=1,2,3,.... To say that a random variable has geometric distribution with parameter ppp, we write X∼Geometric(p)X∼Geometric(p)X \sim Geometric(p). More formally, we have the following definition
  
  Q: Is there a way to define a geometric dist if the underlying trials can have more than 2 outcomes? E.g. would you use this if throwing a die, for the number of times you had to throw until you got a 5?
13. nikostro 10 Sep 2022
  
  in Public
  
  the indicator random variable IAIAI_A for an event A
  
  so each event would have to have its own indicator random variable; if S = {a,b,c,d} then X_i = {1 if i, 0 otherwise}, for i in S.
14. nikostro 10 Sep 2022
  
  in Public
  
  A Bernoulli random variable is a random variable that can only take two possible values, usually 000 and 111.
  
  for instance, if your random experiment is throwing a coin once, then S is {H, T} and X:S->R is {0,1}.
  
  Question: does the notation f:S->R mean that each element in S maps to one element in R? I think yes: but in general, you may have multiple elements mapping to the same values in the co-domain.
15. nikostro 10 Sep 2022
  
  in Public
  
  some specific distributions that are used over and over in practice,
  
  for instance, I wrote down the distribution P(y) = p(1-p)^y for the probability of throwing a coin y times before getting a head, when the probability of getting a head is p.
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter3/3_1_5_special_discrete_distr.php
www.probabilitycourse.com www.probabilitycourse.com

Variance | Standard Deviation

1
1. nikostro 11 Sep 2022
  
  in Public
  
  E[X2]−μ2X.
  
  nothing disappears to get the nicer form of the variance, it's just that the E[X]^2 terms subtract
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter3/3_2_4_variance.php
www.probabilitycourse.com www.probabilitycourse.com

Functions of Random Variables | PMF | CDF | Expected Value | Law of the Unconscious Statistician

1
1. nikostro 11 Sep 2022
  
  in Public
  
  If we already know the PMF of XXX, to find the PMF of Y=g(X)Y=g(X)Y=g(X), we can write PY(y)=P(Y=y)=P(g(X)=y)=∑x:g(x)=yPX(x)
  
  intuitive: to find the probability that Y = y, add up the probabilities of all the x's.
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter3/3_2_3_functions_random_var.php
www.probabilitycourse.com www.probabilitycourse.com

Probability Mass Function | PMF

4
1. nikostro 09 Sep 2022
  
  in Public
  
  for any set A⊂RX
  
  pretty sure this is also true for normal probabilities, why is it being stated explicitly here?
2. nikostro 09 Sep 2022
  
  in Public
  
  The phrase distribution function is usually reserved exclusively for the cumulative distribution function CDF
  
  why?
3. nikostro 09 Sep 2022
  
  in Public
  
  The function
  
  just gives the probabilities of each of the countable outcomes for X
4. nikostro 09 Sep 2022
  
  in Public
  
  For a discrete random variable XXX, we are interested in knowing the probabilities of X=xkX=xkX=x_k.
  
  Essentially, a RV lets us take the things in the real world (events, such as a sequence of heads or tails) and turn it into a new thing that behaves in the same way, i.e. the output of the function X:S->RealNos can be thought of kind of as a new sample space, and we can assign probabilities to it.
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter3/3_1_3_pmf.php
Aug 2022
www.wpbeginner.com www.wpbeginner.com

24 Tips to Speed Up WordPress Performance (UPDATED)

2
1. nikostro 29 Aug 2022
  
  in Public
  
  cron jobs
  
  chronological/scheduled jobs
2. nikostro 29 Aug 2022
  
  in Public
  
  As a website owner, it’s your responsibility to keep your WordPress site, theme, and plugins updated to the latest versions.
  
  site adminsitration
Visit annotations in context

Annotators

nikostro

URL

wpbeginner.com/wordpress-performance-speed/
www.probabilitycourse.com www.probabilitycourse.com

Conditional Independence

3
1. nikostro 23 Aug 2022
  
  in Public
  
  not conditionally independent given CCC.
  
  given the additional info that C happened, we can no longer treat A and B as independent. I.e. the additional info that C leads to a dependence between A and B.
2. nikostro 23 Aug 2022
  
  in Public
  
  One important lesson here is that, generally speaking, conditional independence neither implies (nor is it implied by) independence.
  
  Just because some events are conditionally independent, doesn't mean they're unconditionally independent, and v.v. This makes sense because in fact there are always conditions on events, if only implicitly, so if conditional indpendence doesn't imply unconditional independence, the converse shouldn't hold either.
3. nikostro 23 Aug 2022
  
  in Public
  
  P(A|C)=P(A|C)=P(A|C).
  
  common sense generalisation of P(A|B) = P(A) for A independent of B.
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter1/1_4_4_conditional_independence.php
www.probabilitycourse.com www.probabilitycourse.com

Conditional Probability | Formulas | Calculation | Chain Rule | Prior Probability

3
1. nikostro 22 Aug 2022
  
  in Public
  
  P(A)P(B|A)P(C|A,B).
  
  probability that A n B n C is just the (probability that C happened given that AnB happened) x P(AnB) happened
2. nikostro 22 Aug 2022
  
  in Public
  
  P(A)P(B|A
  
  P(A,B)
3. nikostro 19 Aug 2022
  
  in Public
  
  rewritten
  
  the key idea is basically that the sample set is reduced to C
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter1/1_4_0_conditional_probability.php
www.probabilitycourse.com www.probabilitycourse.com

Functions | Domain | Range

2
1. nikostro 18 Aug 2022
  
  in Public
  
  range of a function is always a subset
  
  why not just have Range(f) = codomain(f)
2. nikostro 17 Aug 2022
  
  in Public
  
  Functions
  
  paused here
Visit annotations in context

Annotators

nikostro

URL

probabilitycourse.com/chapter1/1_2_4_functions.php

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL