Let’s first review rules for expectation, variance, and covariance so I won’t have to go through it throughout my notes:
First, we define the term “expectation.” Expectation is basically the average value of a function, which we can calculate both continuously and discretely:
E[X]=∫xp(x)dx,E[X]=x∑xp(x)
,where p(x) is the probability density function, or the probability that the x value occurs.
We can also define variance and covariance.
Variance averages the squared distance between the true X value and the mean.
Var[X]=E[(X−E[X])2]
Covariance measures how X and Y move together: a larger covariance value indicates a stronger relationshp between X and Y. It’s important to note that covariance cannot be compared across mutliple different situations, and it scales differently!
Cov[X,Y]=E[(X−E[X])(Y−E[X])]
Here are some important algebra rules. Although not too relevant in this lecture, it’ll definetely come up more later:
Linearity of expectations: E[aX+bY]=aE[X]+bE[Y]
Variance identity: Var[X]=E[X2]−(E[X])2
Covariance identity: Cov[X,Y]=E[XY]−E[X]E[Y]
Covariance symmetry: Cov[X,Y]=Cov[Y,X]
Variance is covariance with itself: Cov[X,X]=Var[X]
Variance is not linear: Var[aX+b]=a2Var[X]
Covariance is not linear: Cov[aX+β1Y]=aCov[X,Y]
Variance of a sum: Var[X+Y]=Var[X]+Var[Y]+2Cov[X,Y]
Law of total expectation (Note: This looks fancy but it’s actually really intuitive; if you average out every X value on each event Y, you just get the average of X): E[X]=E[E[X∣Y]]
Independence implies zero covariance: If X and Y are independent, then Cov[X,Y]=0, but the reverse is not true!
Law of large numbers (if you repeat something multiple times, the average result gets closer to the mean, and more formally): With random variables X1,X2,X3,…,Xn with expected value E[X], then:
n1i=1∑nXi→E[X]
Central limit theorem (when you average many independent variables, the distribution of the average is around normal, like a bell curve): Let X1,X2,…,Xn be independent, identically distributed random variables with mean γ and variance σ2 as n→∞:
nVar[X]Xˉn−E[X]≈N[0,1]
Statistic: A statistic is a function of the data and the data alone.
Estimator: Statistic that guesses at the parameter or a function of it, written as θ^
Best prediction given random Y
We can write the error for a given value as (Y−m)2, where Y is a random, real value and m is our prediction. The MSE, or the expected value of this is: E[(Y−m)2]. We can define the MSE as:
We can also introduce the following formulas for any values Z, Y and m:
Var(Z)=E[(Z−E[Z])2]
and:
E[Z2]=(E[Z])2+Var(Z)
Combining these two formulas results in a something called bias-variance decomposition, whcih we can apply to our function for MSE:
E[(Y−m)2]=(E[Y−m])2+Var(Y−m)
We can also rewrite: E[Y−m]=E[Y]−m, because subtracting a constant doesn’t change the expected value!
Also: Var(Y−m)=Var(Y), because subtracting a constant does not change variance.
So, we can rewrite the MSE as:
MSE(m)=(E[Y]−m)2+Var(Y)
Since our prediction is m, we’re trying to minimize the (E[Y]−m2) term. We can turn to calulus, take the derivitave, and set it equal to zero to find the “minimum” of our function.
We can use the chain rule:
dmd(E[Y]−m)2=−2(E[Y]−m)
Now we set it equal to zero:
−2(E[Y]−m)=0⟹E[Y]=m
This means that the best single number prediction of a random variable under squared error loss is just the mean! (This is decently intuitive, but it’s cool to go out and derive it yourself).
Prediction one random variable from another
Let’s say we observe X and want to predict Y. If X=x, we predict m(x).
The law of total expectation states that:
E[Y−m(X)2]=E[E[(Y−m(C))2∣X]]
Now we restrict m(X)=β0+β1X so that we can find the optimial linear predictor (just for now)! So:
MSE(β0,β1)=E[(Y−(β0+β1X))2]
We can multiply this out and distribute the expectation:
Let’s find values of β0 and β1 that minimizes the function. We can use our same derivitave trick we did earlier to find a minimum, starting with β0. We need a partial derivitave this time since we’re dealing with multiple varibles β0 and β1, starting with β0. We’re eliminating all constants without β0
∂β0∂MSE=∂β0∂(β02−2β0E[Y]+2β0β1E[X])
I’m going to skip over some trivial calculus steps, so if you can’t follow, then you should learn calculus before reading this article. This is equal to:
2β0−2E[Y]+2β1E[X]
Setting equal to 0, we can then solve and get β0=E[Y]−E[Y]
Now we set it equal to 0, and with some careful algebra algebra we get E[XY]=β0E[X]+β1E[X]2+β1E[X2]
add stuff here askdfjhfks
Here are some important notes about the equation that was mentioned:
The optimal intercept β0 makes the line go through the mean Y value of the mean X value. This means that β0+β1E[X]=E[Y] and the line passes through E[X],E[Y]. We can easily prove this by substituting E[X] for all X in our equations.
We should sanity check our full equation by making sure that the units “balance” on both sides of the equation.
Only the variance and the covariance impact the β1 term, meaning the actual expectation doesn’t matter! β1 will stay the same, regardless if you shift the function.
We must also keep in mind some important morals (although a bit self-explanatory): 1. This approximation might not even be good! 2. We make no assumptions about the noise or fluctuations of the function.