H={x|wTx+b=0}
what does this mean ?
H = {all points x such that wT.x + b = 0} that is the equation of the hyperplace.
H={x|wTx+b=0}
what does this mean ?
H = {all points x such that wT.x + b = 0} that is the equation of the hyperplace.
SVM
An amazing and simple video on SVM: https://www.youtube.com/watch?v=1NxnPkZM9bc
The only difference is that we have the hinge-loss instead of the logistic loss.
What is hinge loss and logistic loss ??
If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes.
Why ??
The slack variable ξiξi\xi_i allows the input xixi\mathbf{x}_i to be closer to the hyperplane
How ?
Mγ≤w⃗ ⋅w⃗ ∗
w.w new = w.w after M updates w.w old = w.w before M updates w.w new = w.w old + M.Gamma M.Gamma < = w.w* new
y2=1
as y E {-1, +1}
w⃗ ∗w→∗\vec{w}^* lies on the unit sphere
What does this mean ??
w⃗ ⋅xi→
If one were to take the dot product of a unit vector A and a second vector B of any non-zero length, the result is the length of vector B projected in the direction of vector A
Quiz#1: Can you draw a visualization of a Perceptron update? Quiz#2: How often can a Perceptron misclassify a point x⃗ x→\vec{x} repeatedly?
doubts 1) http://www.nbertagnolli.com/jekyll/update/2015/08/27/Perceptron_Vis.html
DDD (sequence of heads and tails)
D is the sequence i.e. y Theta is P(H)
E
θθ\theta as a random variable
Let P(H) be variable. P(D) is a constant as it has already occured.
derivative and equating it to zero
At maxima and minima, the derivatives are always zero
We can now solve for θθ\theta by taking the derivative and equating it to zero.
why ?
Posterior Predictive Distribution
Doubtful about this. Refer video: https://www.youtube.com/watch?v=R9NQY2Hyl14
Now, we can use the Beta distribution to model P(θ)P(θ)P(\theta): P(θ)=θα−1(1−θ)β−1B(α,β)
Important shit! https://www.youtube.com/watch?v=v1uUgTcInQk
HH\mathcal{H}
H is the hypothetical class (i.e., the set of all possible classifiers h(⋅))
MLE Principle:
This is very important
X
What does P(X,Y) mean ?
=argminw1n∑i=1n(x⊤iw−yi)2+λ||w||22λ=σ2nτ2
This means we minimize the loss and the magnitude or w. so some of the weights for the noisy (high variance) features in x become zero. https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a
P(w)
P(w) - w is considered to be a random variable varying over Gaussian distribution.
12πσ2
This is the formula for gaussian distribution
⊤
w = matrix of weights [w1, w2, w3, w4, w5] w^t = transpose of w transpose of w * x should be a scalar . cross-product
argminw1n
Where did the n come from in denominator
Linear Regression
Need to revise this again. A lot of doubts.
This gives you a good estimate of the validation error (even with standard deviation)
why ??
Regression Trees
I don't get these.
O(nlogn)
How?
Decision trees are myopic
Doubtful.
Quiz: Why don't we stop if no split can improve impurity? Example: XOR
I don't get this :(
−∑kpklog(pk)
This is the value of Entropy
KLKLKL-Divergence
What is KL-Divergence?
Rescue to the curse:
Dimensionality reduction may have better data
ϵNN
Doubt ful about this. Shouldn't it be P(y|xt)P(y|xnn) + P(y|xt)P(y|xnn) ??
How does kkk affect the classifier? What happens if k=nk=nk=n? What if k=1k=1k =1?
As per my project,the accuracy changes with K. As k -> n, the accuracy drops down. (Refer project 1 report)
−1
This means when Y is not 1 or 0. This seems like a typing mistake here.
Generalization: ϵ=E(x,y)∼P[ℓ(x,y|h∗(⋅))],
What is this ? Doubt. What is E here ?
i.i.d.i.i.d.i.i.d..
independent and identically distributed data points
C=RC=R\mathcal{C}=\mathbb{R}.
What is R here ? Is the data set and label set have same space?