Title
We have a need to understand the high dimensional data to find the low dimensional representation, to classify data points, and to predict out-of-sample data points.
Some methods about which I have written blogs are
- OLS
- SVD
- PCA
- LDA
Now we will talk about $L_p$ optimization! ($p \geq 1$)
$L_2$ is nice because we have the Hilbert Space property! For p >1, all the $L_p$ are norm spaces. They all differ by the distance function we choose. Its like what type of glass we put on. They also signifiy which type of distance to give more weight to. Some glasses are sensitive to a certain type of distances.
In the below figure (from strang's book), for the 2-norm, the circle is the level set and it is the closest level set to the origin. This point is not same for the L1 norm. For l1, sum of x and y coordinate has to be same. L-1 norm is much less interested in balancing x and y coordinates like L2. it just looks at the sum. So, intuitively it puts more weight on the y-coordinate because its coeeficient is 4. So, this leads to a sparse solution.
For the $L_\infty$ is completely opposite to $L1$. It is completely balancing both the coordinates. So, the solution comes at the location where x=y.
**Why L1 norm?**
Let's say we have noisy data, and we want to do OLS. There are some extreme outliers.
$min_c ||y = cx ||_p$
Because L2 norm wants to balance out between all the points. (because the square of the outlier error will contribute largely). So the line will be tilted towards the outlier. If we do L_inf then there will be even more correction towards the outlier.
In this case, the L1 norm will not care about the outlier. It will chose to remain closer to most of the points at the cost of being distant from the outlier.
Its a bit tricky to solve L1 problem though in recenet years there are research on this. L1 is a concave optimization problem so it is not that much difficult as well.
L1 is sparse and is useful if you have extremely large dataset (maybe be a billion dimention space). So you would want to put some extra costs to having many many non-zero parameters. L1 will find a sparse structure.
We can take this even further. If we take the $0<p<1$, the optimization is still valid but the $||a-b||_p$ is no longer a distance function. Because the triangle inequality fails!
$$ ||1, 1||_p = 2^{1/p} > 2 = ||1, 0||_p + ||0, 1||_p $$
This means that the property that we are familiar with i.e. "shortest distance between two points is a straight line" breaks down.
So if you want to go from (0,0) to (1,1) -- we wont travel in a straight line. We will go to (0,1) and then to (1, 1). This is very exotic space. Mathematically we will have to move to topological vector spaces which is more advanced than the so familiar metric spaces.
**Why we are interested?**
The smaller it gets, the sparser the solution is. So there is a very huge penalty for the small numbers. So we can see that in the limiting case the level sets becomes very degenrate i.e. the only thing it cares about is having as less non-zero coefficients.
For L_0 case, it is exactly becomes a minimizing the non-zero components in the estimation. The level sets are the axes. But historically, it has been a NP-Hard problem to solve this optimization. There is no steepest descent you can use to solve this extremely non-concave optimization.
So it turns out, L1 optimization gives avery good approximation for L0 optimization.
Below figure shows how the norms vary with x. For L1, the minima is at x=0. and it keeps on shifting right side with increment of p upto 1/7. For p < 1, it shoots up when x is non-zero.
Basis pursuit is L1, LASSO is joint L2 and L1 where we minimize the joint case. But as both are concave, we would be able to generate efficient solutions.