Big Data in Finance Lec1

Basics of Machine Learning

y -> outcome (position of the planet)

ML GOAL: Predict y as a function of x gives a training set {(xi,yi)}\{(x_i,y_i)\}

2 main ways:

  • Model based
  • Instance based

Model Based

  • I choose model y(x,θ)y(x,\theta), θ\theta is the parameters of the model.
  • Estimate θ^\hat\theta, given a new point x* not previously observed, y(x,θ)=yy(x*,\theta) = y*
  • e.g. Regression

Instance Based

  • establish a similarity metric
  • when a new xx* is given, you compute the corresponding yy* by averaging yiy_i corresponding to xix_i “close” to xx*
  • e.g. KNN

Supervised Learning

{(xi,yi)}\{(x_i,y_i)\} is given, yiy_i have labels.

  • regression
  • KNN
  • Decision trees
  • SVM

Unsupervised Learning

{xi}\{x_i\} is given, We want to find the structure of {xi}\{x_i\}.

  • K-means
  • PCA

Batch Learning

Online Learning

keep updating data

Least Square

y(x,β)=xTβxRnyixiTββ=argmin12i=1NyixiTβ2=argmin12yxTβ2βL(β)=XTXβXTy=0β^=(XTX)1XTyy(x,\beta) = x^T\beta \qquad x\in \mathbb{R}^n \\ y_i \approx x_i^T\beta \\ \beta = argmin \frac{1}{2}\sum_{i=1}^N|y_i - x_i^T\beta|^2 \\ = argmin \frac{1}{2}||y - x^T\beta||^2 \\ \Rightarrow \frac{\partial}{\partial\beta}L(\beta) = X^TX\beta - X^Ty = 0 \\ \Rightarrow \hat\beta = (X^TX)^{-1}X^Ty

KNN : y(x)=1kxiNk(x)yiy(x) = \frac{1}{k}\sum_{x_i\in N_k(x)} y_i

How KNN is related to

  1. expectation is replaced by empirical mean
  2. the auditioning is relaxed to k points

you can show

limK,N1kxiNk(x)yi=E[yX=x]limK,NKN=0lim_{K,N\rightarrow \infty} \frac{1}{k}\sum_{x_i\in N_k(x)} y_i = \mathbb{E}[y|X=x] \\ lim_{K,N\rightarrow \infty} \frac{K}{N} = 0

Y=f(X)+ϵY = f(X) + \epsilon \quad

Curse of Dimensionality
Big data are data usually defined in very high dimension space. You can show that the median distance dd from the origin to he closest data point is given by

P(xi>d,i)P(x_i>d, \forall i)

β^\hat\beta is an unbiased estimator of E[XTX]1E[YX]E[X^TX]^{-1}E[YX]
y=XTβ+ϵy = X^T\beta+\epsilon
β^=(XTX)1XT(Xβ+ϵ)\hat\beta = (X^TX)^{-1}X^T(X\beta+\epsilon)
ET[β^]=βE_T[\hat\beta] = \beta

Bias-Variance Decomposition

f(x)=E[YX=x]f(x) = E[Y|X=x]
minE[(Yf(X)]\Rightarrow \min E[(Y-f(X)]
Let y(x)y(x) be my model forf(x)f(x)
E[(yf(x)+f(x)y(x))2]=E[(yf(x))2]+E[(f(x))y(x))2]E[(y-f(x) + f(x) - y(x))^2] = E[(y-f(x))^2] + E[(f(x)) - y(x))^2]
since the cross term is zero, use tower property of expectation to prove

Penalization Methods

Ridge Regression

β^=argminyXβ2+λβ2\hat\beta = argmin||y-X\beta||^2+\lambda||\beta||^2
β=(XTX+λI)1XTy\beta = (X^TX+\lambda I)^{-1}X^Ty

LASSO Regression

β^=argminyXβ2+λiβi\hat\beta = argmin||y-X\beta||^2+\lambda\sum_i|\beta_i|
β=(XTX+λI)1XTy\beta = (X^TX+\lambda I)^{-1}X^Ty

Cross Validation

  • fix λ\lambda
  • divide the T (training set) into K groups of equal size
  • \forall group j train the data on the remaining K-1 groups and compute the EjλE^\lambda_j (mean squared error of group jj under λ\lambda)
  • CV(λ)=1KEjλCV(\lambda) = \frac{1}{K}\sum E^\lambda_j

Big λ\lambda will have high variance.
small λ\lambda will have high bias.

0%