Theorem Law of Iterated Expectations

If $E(|y|) < \infty$, then for any random variable $\textbf{x}$: $$E(E(y|\textbf{x})) = E(y)$$

General Law of Iterated Expectations

If $E(|y|) < \infty$, then for any random variable $\textbf{x}_1$, $\textbf{x}_2$: $$ E(E(y|\textbf{x}_1, \textbf{x}_2)|\textbf{x}_1) = E(y|\textbf{x}_1)$$

Conditioning Theorem

If $E(|g(\textbf{x})y|) < \infty$,

$$ E(g(\textbf{x})y|\textbf{x}) = g(\textbf{x}) E(y|\textbf{x})$$

and

$$ E(g(\textbf{x})y) = E(g(\textbf{x}) E(y|\textbf{x}))$$

CEF Error

The CEF error e is the differnce between y and CEF $m(\textbf{x})$: $e = y - m(x)$

$E(e|\textbf{x}) = E(y-m(\textbf{x}) | \textbf{x}) = E(y|\textbf{x}) - m(\textbf{x}) = 0$

Law of Iterated expectations shows more!: $E(e) = E(E(e|\textbf{x})) = E(0) = 0$

For any $h(\textbf{x})$ : $E(h(\textbf{x})e) = 0$

Any predictor $g(x)$ is the CEF if and only if: $ E(e_g| \textbf{x}) = 0$

Example: Intercept Model: Here $m(x)$ is a constant $ = E(y) = \mu$

Variance of the CEF Error

If we didnt observe x, m(x) would be a constant which is mean of y i.e. $E(y)$. But given that we observe x, it gives a lot more information to us. $m(x)$ is a function of x and we can predict how y would behave with different x.

How can we measure how much extra information x is giving? Ans: by computing variance of error. If there is less variance, then more information in x. High variance means less information in x. The variance measures the variation in y that is not explained by the conditional mean $E(y|x)$.

$Var(e) = E((e- E(e))^2) = E(e^2)$. The error variance depends on x. Because, lets say 2 models:

$y = E(y|x_1) + e_1$

$y = E(y|x_1, x_2) + e_2$

$\implies \sigma_1^2 \neq \sigma_2^2$

Theorem: $Var(y) \geq Var(y - E(y|x_1))$ (more info $\implies$ less variance)

Example

Suppose z = (x, y) are jointly normal with zero means $\mu = (0, 0)'$ and covariance matrix

$$ \Sigma = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix} $$

The CEF of y given x is $E(y|x) = m(x) = \rho x $

The variance of CEF error is $Var(e) = 1 - \rho^2$

Best Predictor

Best predictor $g(x)$ is one that minimize the MSE = $E(y - g(x))^2$

The CEF $m(x)$, regardless of the joint distribution of $(y, \textbf{x})$, minimizes the MSE!

The conditional variance of y given x is $$ \sigma^2(y|\textbf{x}) = \sigma^2(\textbf{x}) = Var(y|\textbf{x}) = E((y - E(y|x))^2|x)= E(e^2|\textbf{x})$$

For the above example, if the correlation is 1, the conditional variance of (y|\textbf{x}) is 0!

Conditional variance is how much info is left in y after we remove x.

The unconditional variance of the error is teh average of conditional variance. $\sigma^2 = E(e^2) = E(E(e^2|\textbf{x})) = E(\sigma^2(\textbf{x}))$

Any multivariate rv $z = (y, x)$ can be decomposed as: $$ y = m(x) + \sigma(x) \epsilon $$

Where, $m(x) = E(y|x), \epsilon = \frac{e}{\sigma(x)}, E(\epsilon|x) = 0, Var(\epsilon|x) = 0$

Often conditional variance is ignored.

$\sigma(x) = \sigma =$ const then it is homoskedasticity. If teh volatility of error is not constant then heteroskedastcity.

CEF Derivative

How does CEF vary with small change in x? The marginal effect of $x_1$ is

$$ \Delta_1 m(\textbf{x}) = \frac{\partial}{\partial x_1} m(x_1, ..., x_k)$$

note that this derivative doesnt measure change in y, but change in the conditional expectation of y.

Summary: $m(x)$ minimizes MSE. $E(e|x) = 0; E(e) = 0; y = m(x) + \sigma(x) \epsilon$

CEF can be non-linear. So, next is to understand, how to get a linear regressor?

Similarly, CEF: $E(e|\textbf{x}) = 0 <=> E(e|x_j)$

$E(e|x)$ implies $E(xe) = 0$

So CEF is more powerful than BLP. And both are examples of moment estimators.