M data obervations, each in $\mathbb{R}^N$. i.e. $x \in \mathbb{R}^N, m = 1, ..., M$

We know that observations come from two types of points. A-points and B-points.

Both points are normally distributed and different variance-covariance matrices.

For the points, we have to classify each of them belong to type-A or type-B i.e. $t^{'}_m \in {A, B}$

minimize the likelihood:

min $\alpha^A\mathbb{P}(t^{'} = A | t=B) + \alpha^B\mathbb{P}(t^{'} =B | t=A)$

So we want to find two subplanes separated by a hyperplane. One side is type A and one is type B. i.e. find $v \in \mathbb{R}^N$ and $c \in \mathbb{R}$ such that

$v^T x^m < c \implies t^{'} = A$
$v^T x^m > c \implies t^{'} = B$

For a 1-D case, probability of misclassifying A is $1 - \Phi(\frac{c-\mu_A}{\sigma_A})$.

Similarly, for misclassifying B is $\Phi(\frac{c-\mu_B}{\sigma_B})$.

We want to minimize this probability.

Proof

Without loss of generality $v = 1$

$c = \frac{\sigma^2_B}{\sigma^2_A + \sigma^2_B} \mu_A + \frac{\sigma^2_A}{\sigma^2_A + \sigma^2_B} \mu_B$

For multivariate case,

$v = (\Sigma_A+\Sigma_B)^{-1} $

For linear discriminant analysis we generally assume $\Sigma_A = \Sigma_B$. For this case $\alpha_A = \alpha_B = 1$, so same weight on A and B misclassification.

We generally go for Quadratic Discriminant Analysis (QDA) when $\Sigma_A \neq \Sigma_B$.

Nonlinear Classification Problem

For the non-linear data, LDA is not a suitable method as we cant find a line to separate these.

Here, Machine Learning comes into play! In linear case, we try to increase the dimensionality for more complex problems. But in Machine learning it does the simple thing over and over again.

Lets consider a nonlinear approximation $$ (Ax + b)_+.$$ Here the nonlinear operation is $(x)_+$. It's a ReLU operator. We do this elementwise on the vector and then we sum the vectors by multiplying with $1$ vector. $F = \textbf{1'F}$ These are not so nice because they are piece-wise linear but not continuously differentiable.

For different A and B, we will get very different level sets. So, by iterating again and again we can generate a continuous function! Piece-wise linearity and continuity kept, regardless of N. Is it continuously diffentiable?

Basically this is what the neural network optimization does!