 Oct 2019

www.cs.cornell.edu www.cs.cornell.edu

H={xwTx+b=0}
what does this mean ?
H = {all points x such that wT.x + b = 0} that is the equation of the hyperplace.

SVM
An amazing and simple video on SVM: https://www.youtube.com/watch?v=1NxnPkZM9bc

The only difference is that we have the hingeloss instead of the logistic loss.
What is hinge loss and logistic loss ??

If the data is low dimensional it is often the case that there is no separating hyperplane between the two classes.
Why ??

The slack variable ξiξi\xi_i allows the input xixi\mathbf{x}_i to be closer to the hyperplane
How ?


www.cs.cornell.edu www.cs.cornell.edu

Mγ≤w⃗ ⋅w⃗ ∗
w.w new = w.w after M updates w.w old = w.w before M updates w.w new = w.w old + M.Gamma M.Gamma < = w.w* new

 Sep 2019

www.cs.cornell.edu www.cs.cornell.edu

y2=1
as y E {1, +1}

w⃗ ∗w→∗\vec{w}^* lies on the unit sphere
What does this mean ??

w⃗ ⋅xi→
If one were to take the dot product of a unit vector A and a second vector B of any nonzero length, the result is the length of vector B projected in the direction of vector A

Quiz#1: Can you draw a visualization of a Perceptron update? Quiz#2: How often can a Perceptron misclassify a point x⃗ x→\vec{x} repeatedly?


www.cs.cornell.edu www.cs.cornell.edu

DDD (sequence of heads and tails)
D is the sequence i.e. y Theta is P(H)

E

θθ\theta as a random variable
Let P(H) be variable. P(D) is a constant as it has already occured.

derivative and equating it to zero
At maxima and minima, the derivatives are always zero

We can now solve for θθ\theta by taking the derivative and equating it to zero.
why ?

Posterior Predictive Distribution
Doubtful about this. Refer video: https://www.youtube.com/watch?v=R9NQY2Hyl14

Now, we can use the Beta distribution to model P(θ)P(θ)P(\theta): P(θ)=θα−1(1−θ)β−1B(α,β)
Important shit! https://www.youtube.com/watch?v=v1uUgTcInQk

HH\mathcal{H}
H is the hypothetical class (i.e., the set of all possible classifiers h(⋅))

MLE Principle:
This is very important

X
What does P(X,Y) mean ?


www.cs.cornell.edu www.cs.cornell.edu

=argminw1n∑i=1n(x⊤iw−yi)2+λw22λ=σ2nτ2
This means we minimize the loss and the magnitude or w. so some of the weights for the noisy (high variance) features in x become zero. https://towardsdatascience.com/regularizationinmachinelearning76441ddcf99a

P(w)
P(w)  w is considered to be a random variable varying over Gaussian distribution.

12πσ2
This is the formula for gaussian distribution

⊤
w = matrix of weights [w1, w2, w3, w4, w5] w^t = transpose of w transpose of w * x should be a scalar . crossproduct

argminw1n
Where did the n come from in denominator

Linear Regression
Need to revise this again. A lot of doubts.


www.cs.cornell.edu www.cs.cornell.edu

This gives you a good estimate of the validation error (even with standard deviation)
why ??


www.cs.cornell.edu www.cs.cornell.edu

Regression Trees
I don't get these.

O(nlogn)
How?

Decision trees are myopic
Doubtful.

Quiz: Why don't we stop if no split can improve impurity? Example: XOR
I don't get this :(

−∑kpklog(pk)
This is the value of Entropy

KLKLKLDivergence
What is KLDivergence?


www.cs.cornell.edu www.cs.cornell.edu

Rescue to the curse:
Dimensionality reduction may have better data

ϵNN
Doubt ful about this. Shouldn't it be P(yxt)P(yxnn) + P(yxt)P(yxnn) ??

How does kkk affect the classifier? What happens if k=nk=nk=n? What if k=1k=1k =1?
As per my project,the accuracy changes with K. As k > n, the accuracy drops down. (Refer project 1 report)

−1
This means when Y is not 1 or 0. This seems like a typing mistake here.


www.cs.cornell.edu www.cs.cornell.edu

Generalization: ϵ=E(x,y)∼P[ℓ(x,yh∗(⋅))],
What is this ? Doubt. What is E here ?

i.i.d.i.i.d.i.i.d..
independent and identically distributed data points

C=RC=R\mathcal{C}=\mathbb{R}.
What is R here ? Is the data set and label set have same space?
