|Project name|| Main project page|
Previous entry Next entry
"Advanced Data Analysis from an Elementary Point of View" by Cosma Shalizi
(This page summarizes my notes about this great course. All the course is available online, so you're likely to prefer to refer to it directly.)
1.1 Statistics, Data Analysis, Regression
1.2 Guessing the Value of a Random Variable
Use mean squared error to see how bad we are doing when guessing value of Y by using a:
MSE(a) = E[(Y − a)2]
MSE(a) = (E[Y − a])2 + V[Y − a]
MSE(a) = (E[Y] − a)2 + V[Y]
1.2.1 Estimating the Expected Value
If the (yi) are iid, law of large numbers says and central limit theorem indicates how fast convergence is (squared error is about V(Y) / n).
1.3 The Regression Function
Use X (predictor or independent variable or covariate or input) to predict Y (dependent or variable or output or response). How bad are we doing when using f(X) to predict Y?
MSE(f(X)) = E[(Y − f(X))2]
Use law of total expectation (E[U] = E[E[U | V]]):
MSE(f(X)) = E[E[(Y − f(X))2 | X]]
MSE(f(X)) = E[V[Y | X] + (E[Y − f(X) | X])2]
Regression function: r(x) = E[Y | X = x]
1.3.1 Some Disclaimers
Usually we observe Y | X = r(X) + η(X), ie. η (noise variable with mean 0 and variance ) depends on X...
1.4 Estimating the Regression Function
Use conditional sample means:
Works only when X is discrete.
1.4.1 The Bias-Variance Tradeoff
In fact, we have analyzed where is a random regression function estimated using n random pairs (xi,yi).
Even if our method is unbiased (, no approximation bias), we can still have a lot of variance in our estimates ( large).
A method is consistent (for r) when both the approximation bias and the estimation variance go to 0 when we get more and more data.
1.4.2 The Bias-Variance Trade-Off in Action
1.4.3 Ordinary Least Squares Linear Regression as Smoothing
Assume X is one-dimensional and both X and Y are centered. Choose to approximate r(x) by α + βx. Need to find their values a and b minimizing the MSE.
MSE(α,β) = E[(Y − α − βX)2]
MSE(α,β) = E[E[(Y − α − βX)2 | X]]
MSE(α,β) = E[V[Y | X] + (E[Y − α − βX) | X])2]
MSE(α,β) = E[V[Y | X]] + E[(E[Y − α − βX) | X])2]
Now, estimate a and b from the data (replacing population values by sample values, or minimizing the residual sum of squares):
Least-square linear regression is thus a smoothing of the data:
Indeed, the prediction is a weighted average of the observed values yi, where the weights are proportional to how far xi is from the center of the data, relative to the variance, and proportional to the magnitude of x.
Note that the weight of a data point depends on how far it is from the center of all the data, not how far it is from the point at which we are trying to predict.
1.5 Linear Smoothers
Ordinary linear regression:
1.5.1 k-Nearest-Neighbor Regression
if xi is one of the k nearest neighbors of x, 0 otherwise
1.5.2 Kernel Smoothers
For instance use where h is the bandwidth so that
What minimizes the mean absolute error?
MAE(a) = E[ | Y − a | ]
Using Leibniz rule for differentiation under the integral:
The median minimizes the MAE.