40 Matching Annotations
  1. Oct 2019
    1. H={x|wTx+b=0}

      what does this mean ?

      H = {all points x such that wT.x + b = 0} that is the equation of the hyperplace.

    2. SVM

      An amazing and simple video on SVM: https://www.youtube.com/watch?v=1NxnPkZM9bc

    3. The only difference is that we have the hinge-loss instead of the logistic loss.

      What is hinge loss and logistic loss ??

    4. If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes.

      Why ??

    5. The slack variable ξiξi\xi_i allows the input xixi\mathbf{x}_i to be closer to the hyperplane

      How ?

    1. Mγ≤w⃗ ⋅w⃗ ∗

      w.w new = w.w after M updates w.w old = w.w before M updates w.w new = w.w old + M.Gamma M.Gamma < = w.w* new

  2. Sep 2019
    1. y2=1

      as y E {-1, +1}

    2. w⃗ ∗w→∗\vec{w}^* lies on the unit sphere

      What does this mean ??

    3. w⃗ ⋅xi→

      If one were to take the dot product of a unit vector A and a second vector B of any non-zero length, the result is the length of vector B projected in the direction of vector A

    4. Quiz#1: Can you draw a visualization of a Perceptron update? Quiz#2: How often can a Perceptron misclassify a point x⃗ x→\vec{x} repeatedly?
    1. DDD (sequence of heads and tails)

      D is the sequence i.e. y Theta is P(H)

    2. E
    3. θθ\theta as a random variable

      Let P(H) be variable. P(D) is a constant as it has already occured.

    4. derivative and equating it to zero

      At maxima and minima, the derivatives are always zero

    5. We can now solve for θθ\theta by taking the derivative and equating it to zero.

      why ?

    6. Posterior Predictive Distribution

      Doubtful about this. Refer video: https://www.youtube.com/watch?v=R9NQY2Hyl14

    7. Now, we can use the Beta distribution to model P(θ)P(θ)P(\theta): P(θ)=θα−1(1−θ)β−1B(α,β)
    8. HH\mathcal{H}

      H is the hypothetical class (i.e., the set of all possible classifiers h(⋅))

    9. MLE Principle:

      This is very important

    10. X

      What does P(X,Y) mean ?

    1. =argminw1n∑i=1n(x⊤iw−yi)2+λ||w||22λ=σ2nτ2

      This means we minimize the loss and the magnitude or w. so some of the weights for the noisy (high variance) features in x become zero. https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a

    2. P(w)

      P(w) - w is considered to be a random variable varying over Gaussian distribution.

    3. 12πσ2

      This is the formula for gaussian distribution

    4. w = matrix of weights [w1, w2, w3, w4, w5] w^t = transpose of w transpose of w * x should be a scalar . cross-product

    5. argminw1n

      Where did the n come from in denominator

    6. Linear Regression

      Need to revise this again. A lot of doubts.

    1. This gives you a good estimate of the validation error (even with standard deviation)

      why ??

    1. Regression Trees

      I don't get these.

    2. O(nlogn)

      How?

    3. Decision trees are myopic

      Doubtful.

    4. Quiz: Why don't we stop if no split can improve impurity? Example: XOR

      I don't get this :(

    5. −∑kpklog(pk)

      This is the value of Entropy

    6. KLKLKL-Divergence

      What is KL-Divergence?

    1. Rescue to the curse:

      Dimensionality reduction may have better data

    2. ϵNN

      Doubt ful about this. Shouldn't it be P(y|xt)P(y|xnn) + P(y|xt)P(y|xnn) ??

    3. How does kkk affect the classifier? What happens if k=nk=nk=n? What if k=1k=1k =1?

      As per my project,the accuracy changes with K. As k -> n, the accuracy drops down. (Refer project 1 report)

    4. −1

      This means when Y is not 1 or 0. This seems like a typing mistake here.

    1. Generalization: ϵ=E(x,y)∼P[ℓ(x,y|h∗(⋅))],

      What is this ? Doubt. What is E here ?

    2. i.i.d.i.i.d.i.i.d..

      independent and identically distributed data points

    3. C=RC=R\mathcal{C}=\mathbb{R}.

      What is R here ? Is the data set and label set have same space?