Basics of Machine Learning

y -> outcome (position of the planet)

ML GOAL: Predict y as a function of x gives a training set $\{(x_i,y_i)\}$

2 main ways:

Model based
Instance based

Model Based

I choose model $y(x,\theta)$ , $\theta$ is the parameters of the model.
Estimate $\hat\theta$ , given a new point x* not previously observed, $y(x*,\theta) = y*$
e.g. Regression

Instance Based

establish a similarity metric
when a new $x*$ is given, you compute the corresponding $y*$ by averaging $y_i$ corresponding to $x_i$ “close” to $x*$
e.g. KNN

Supervised Learning

$\{(x_i,y_i)\}$ is given, $y_i$ have labels.

regression
KNN
Decision trees
SVM

Unsupervised Learning

$\{x_i\}$ is given, We want to find the structure of $\{x_i\}$ .

K-means
PCA

Batch Learning

Online Learning

keep updating data

Least Square

y(x,\beta) = x^T\beta \qquad x\in \mathbb{R}^n \\ y_i \approx x_i^T\beta \\ \beta = argmin \frac{1}{2}\sum_{i=1}^N|y_i - x_i^T\beta|^2 \\ = argmin \frac{1}{2}||y - x^T\beta||^2 \\ \Rightarrow \frac{\partial}{\partial\beta}L(\beta) = X^TX\beta - X^Ty = 0 \\ \Rightarrow \hat\beta = (X^TX)^{-1}X^Ty

KNN : $y(x) = \frac{1}{k}\sum_{x_i\in N_k(x)} y_i$

How KNN is related to

expectation is replaced by empirical mean
the auditioning is relaxed to k points

you can show

lim_{K,N\rightarrow \infty} \frac{1}{k}\sum_{x_i\in N_k(x)} y_i = \mathbb{E}[y|X=x] \\ lim_{K,N\rightarrow \infty} \frac{K}{N} = 0

$Y = f(X) + \epsilon \quad$

Curse of Dimensionality
Big data are data usually defined in very high dimension space. You can show that the median distance $d$ from the origin to he closest data point is given by
$P(x_i>d, \forall i)$

this formula

$\hat\beta$ is an unbiased estimator of $E[X^TX]^{-1}E[YX]$
$y = X^T\beta+\epsilon$
$\hat\beta = (X^TX)^{-1}X^T(X\beta+\epsilon)$
$E_T[\hat\beta] = \beta$

Bias-Variance Decomposition

$f(x) = E[Y|X=x]$
$\Rightarrow \min E[(Y-f(X)]$
Let $y(x)$ be my model for $f(x)$
$E[(y-f(x) + f(x) - y(x))^2] = E[(y-f(x))^2] + E[(f(x)) - y(x))^2]$
since the cross term is zero, use tower property of expectation to prove

Penalization Methods

Ridge Regression

$\hat\beta = argmin||y-X\beta||^2+\lambda||\beta||^2$
$\beta = (X^TX+\lambda I)^{-1}X^Ty$

LASSO Regression

$\hat\beta = argmin||y-X\beta||^2+\lambda\sum_i|\beta_i|$
$\beta = (X^TX+\lambda I)^{-1}X^Ty$

Cross Validation

fix $\lambda$
divide the T (training set) into K groups of equal size
$\forall$ group j train the data on the remaining K-1 groups and compute the $E^\lambda_j$ (mean squared error of group $j$ under $\lambda$ )
$CV(\lambda) = \frac{1}{K}\sum E^\lambda_j$

Big $\lambda$ will have high variance.
small $\lambda$ will have high bias.