šļøRelated Post: Statistical Implication of Parameter Estimation and Standard Error
This time, I will cover āLinear Regression Modelā. I know there are bunch of great resources on Linear Regression itself, so I will focus more on some of significant implications on Linear Regression, and some of the thoratical backgrounds behind methods we might have used without deeper understanding. Some of the questions answered would be:
- What does Linear Regression imply in terms of conditional prediction? (Why does it work better than mean estimation?)
- What are differences between just showing high correlation, and doing linear regression?
- What impacts accuracy of linear regression prediction? (Or, standard error?)
- Why do we do the log-scale? - Is it just because numbers are so large?
Last time, we have identified:
- āModelingā impies finding out any kind of functions, that can describe and predict āreal world patternsā
- Since it is impossible to get precise value of ātrue patternsā, we are using āestimatorā (e.g. Sample mean as an mean estimator, to estimate mean of population data.)
- āStandard Errorā tell us about confidence interval of the ātrue patternā - e.g. even if we somehow identified ātrue meanā of any elephants, if standard deviation is too large, mean would not help predicting future values āwith confidenceā.
- So, knowing standard deviation is important, but since we canāt not figure out āpopulation standard deviation (std of true patterns)ā we utilize sample standard deviation : āStandard Errorā. Smaller standard error implies, narrower confidence interval - which means we can use estimator to predict future value with āmore confidenceā.
This time, letās look at one of the most widely used and important model - linear regression.
Linear regression model - What is the ācorrectā way to use them?
I think linear regression is one of the modeling method which is widely used, and many people will be used to itās concepts, formulation, or how to implement them in Python or R codes. However, it is more important to implement them with better understanding. For any kind of data, that āseems to have linear correlationā if we put them into the linear regression model, it gives some prediction. However, to what extent could we trust this outcome? What are the factors that impact accuracy of prediction based on linear regression?
From one of the previous post on MLE (maximum likelihood estimator), we had looked into linear regression in terms of Bayesian approach.
Recap: We can interpret āLikelihoodā as: given the observation or data, how likely is the data coming from certain model? For instand, when we toss the coin 10 times and see head 2 times, it is more likely to think probability of getting head is lower than 50%. Here, if we say Īø is probability of getting head, L(Īø)=Īøx(1āĪø)nāx where x=2. Likelihood will be maximized then Īø is around 0.2.
Like we see from the ilustration below, it is more likley to infer that the data below, are generated from ātrue lineā model A, rather than model B. The MLE approach tried to get a model which maximizes likelihood - and we found out that it is same as linear regression model with regularization.
This time, letās try to understand linear regresssion itself.
Simple Linear regression - Itās All About Conditional Prediction
I love using easy examples, so letās keep using example of elephants. I think many of people are used to illustrations like above. Linear regression is modeling pattern between two (or more) variables. However, it is also important to understand that it is āexpanding prediction of Y into conditional prediction given X.ā
Letās remove the X axis, projecting data toward Y axisd. Now we only have sample statistic data of āweightā, so the best predictor for weight of āall elephants in the worldā would be a sample mean - this is exactly same as the mean estimation weād done form the previous post. However, when additional information āAgeā is given (X), we can intuitively see that our model coule be more precise. Why? because based on their age elephants tend to have different weight, but mean prediction does not reflect this information. If we split up group of elephants by range of their age group, we can see that their āgroup meanā moves.
So, what is the conclusion here?
- Linear regression is a conditional perdiction, predicting value of Y given that X=x, where x is the value of data point.
- Also, we can recognize that ālinear regressionā (or conditional prediction) is better than simple mean prediction when Y moves as X moves - which means, X and Y are somehow ācorrelatedā.
Quick Review on Correlation and Covariance
Letās co back to statistic class for a momne. āCovarianceā measures how random variables X and Y moves together (increases, or decreases together.) Cov(X,Y)=āni(XiāĖX)(YiāĖY)n Have you thought of implication behind this formulation? Covariance is high when, for each of data points, value of x is far from the mean, value of your is also far from the mean. If deviation from mean is large for one variable, but small for the other variable, covariance will be canceled out.
Look at the illustrations below. Two cases have same mean of X and Y, but only for the first case two variables are correlated. We could see that for the first case, as X deviates from the mean, Y also deviates from the mean. However, for the second case, for many data points Y value does not deviate from mean while X deviates from the mean.
Corr(X,Y)=Cov(X,Y)ĻXĻY
For sample correlation:
Corr(X,Y)sample=SXYsXsY
Correlation is basically computed by dividing Covariance, with each random variableās standard deviation. By doing this, we are standardizing covariance between the value of -1 and 1. Correlation seems to be representing linear relationship between random variables in some sense, but there is one problem: Correlation is unitless, so we cannot use them for prediction of future value. For instance, if age and weight of elephants show high correlation of 0.8. What does this number 0.8 imply? It only shows āhow correlation is strongā as a relative metric, but does not tell us anything about āhow to predict weight, given ageā. However, the correlation itself is related to the linear regression coefficient.
Linear regression and correlation
Y=b0+b1X
Now, letās look at the linear regression model in detail. We are fitting a line on dataset by representing Y as linear function of X. b0 and b1 are two parameters that decides how to draw the line. Based on given data, what we are doing here is same as what we have done from the mean estimation:
- Assuming there are linear ātrue lineā that generated the data
- Taking given data as sample, we are āestimatingā the ātrue lineā equation.
- If our estimated line seems to be appropriately accurate (which means, it shows small standard error), we can use this estimated line to predict unseen future value.
Values we are estimating will be b0, and b1. From our given dataset, the best estimator should have least square error, as we also mentioned from this post. I will skip how does Least Square Error solution is derived, but Least Square Solution is known as: b1=āNi(XiāĖX)(YiāĖY)āNi(XiāĖX)2=sXYs2X b0=ĖYāb1ĖX
Doenāt this look familiar? This is sample correlation multiplied by sY over sX Corr(X,Y)samplesYsX This implies that:
- The regression coefficient (slope) is, adjusting (scaling) correlation into the unit of Y: divided by STD of X and multiplied by STD of Y.
- In linear regression, b1=0 means two variables are not related, so we cannot predict Y with X. If correlation between X and Y is 0, this will make b1 also 0. This explains intuition that, linear regression model will only be valid if variables are correlated.
Two more insights on residual e, Ļµ, and Lease Square Error solution.
Least Square Solution
Least Square Solution is knows as: b1=āNi(XiāĖX)(YiāĖY)āNi(XiāĖX)2=sXYs2X
And, b0=ĖYāb1ĖX
There are some more implication behind this. If we look at b0 formula, and plug it into Y=b0+b1X, we can find out that this Least Square Solution must pass (ĖX,ĖY).
More abot residual, and error
Now, letās dive deep into Ļµ and e.
From the illustration above, letās say that the āBlue lineā is the āideal true linear lineā which describes linear relatioship between age and weight of all elephants in the worlds. We are trying to āpredictā the line, from the given data. Here is a thing we need to keep in mind:* āTrue lineā itslef also has āirreducible errorā, and we define this as Ļµ.
Even we assume that the ātrue lineā is ideal line, it has error - unless all data points are exactly located on the true line. We call this āirreducible errorā, because even we predict the true line precisely (which wonāt be possible, unless we have infinite data point) we cannot reduce this error. This is why we formulate ātrue line asā
Y=Ī²0+Ī²1X+Ļµ, and our prediction as ĖY=Ī²0+Ī²1X+e. Here are two important insights for residual for āleast square estimatorā:
- Mean of residual should be 0.
- Correlation between e and X should be 0. (Corr(e,X)=0)
- Correlation between ĖY and e should be zero. (Since ĖY is dependent on X, if 2 is true, this is also true. )
The first one seems to be more intuitive: If the regression model has āleast square errorā, sum of itās errors will be canceled out to be zero. How about the second one? It implies āError should be consistent over entire range of Xā.
Decomposition of error term - SSE, SSR, TSS, and R Square
From the last post, we discussed that āVariabilityā is important in measuring predictability (or accuracy) of the model (estimation). This is why we focus on variance, or standard error. So, letās look at how could āVariability of Yā be decomposed.
We know that:
- āNi=1ei=0
- corr(e,X)=0
- corr(ĖY,e)=0
Therefore, we can write Var(Y)=Var(ĖY+e)=Var(ĖY)+Var(e)+2Cov(ĖY,e)=Var(ĖY)+Var(e) Here, āvariability of Yā is decompsed with variability of ĖY and variability of e. We discussed that āerror termā is related to āirriducible errorā. Therefore only the Var(ĖY) is the variability related to the regression model we build - which, we can explain with our regression model. So, we may wish variability related to our regression will be higher that the irreducible error - cause that means our model explain huge part of the variances in Y, and our model fits better.
This is what āR squre all aboutā
Why R Square is not āaccuracy measureā, but measure of āfitā?
It is easy to be too obsessed with R Square, and misinterpret it as āaccuracy measureā. However, if we search the definition, R Square is defined as a coefficient of determination. This is the definition from Wikipedia: R Square is the proportion of the variation in the dependent variable that is predictable from the independent variable(s)
We had seen that Var(Y)=Var(ĖY)+Var(e). Since sample variance of a random variable is 1nā1ā(xiāĖx), we can rewrite the formulation as: āNi=1(YiāĖY)2=āNi=1(^YiāĖY)2+āNi=1e2i These are the definition of TSS, SSR, SSE TSS=SSR+SSE
\[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\]
Now, do you see the intention, and motivation behind R Square?
- R Square does not answer the question how is the model accurate?
- It answers: āHow much of variability is explained by the model?ā
This is about āgoodness of fitā, not about accuracy of prediction.
Then, how do we evaluate accuracy? - Always Standard Error.
So, how do we evaluate the model in terms of the accuracy of prediction? We had done very similar thing for the mean estimation from the previous post - Standard Error. Only the different thing is that, this time we have 3 parametors:
- Ļe: Standard error of regression. We need to estimate, how large is the āstandard deviationā of residuals are.
- Ī²0 and Ī²1: We need to evaluate how is our estimated parameters distributed, and estimate their standard deviation through standard error of estimation.
This is why I covered the concept of standard error in detail, with simple example of mean estimation from the previous post. Whatever we are estimating, what matters is āstandard deviationā - higher the standard deviation, it means that the ārange (interval)ā that unseen future data could likely be located will be wider. For 95% confidence interval, 95% of data are located in estimated point Ā±2Ļ, so larger sigma will cause wider interval.
I wonāt go deep into the calculation details, but each of the standard errors could be induced as the following
-
Ļe: We use sample standard deviation instead: ^ĻĻµ2=s2=1Nā2āNi=1e2i=SSENā2
-
ĻĪ²1 We use sample standard deviation here too: ^ĻĪ²1=āĻ2Ļµ(Nā1)s2xāāVar(b1)=sb1=ās2(Nā1)s2x
Donāt worry about the computation, when we do the regression analysis with Python, or R, it will compute and give all the numbers. However, the thing we need to know is āStandard Error indicates how each of the estimated parameters are credibleā
- Lower SE for slope b1 implies that, āregression coefficientā for two variables are credible.
- Lower SE of regression means, if we make prediction based on the linear model, it is highly likely that actual value will be close enough to our predicted value!
One More thing - What matters for low SE, and why does log-scale works?
For the standard error ^ĻĻµ2, there is similar imlication we saw from mean estimation problem: As we have larger sample size, the standard error will diminish. How about standard error of the slope coefficient? ās2e(Nā1)s2x
- Numerator: Smaller sample standard deviation of āerrorsā will make standard error of the slope smaller. (This is not a standard deviation of the sample data, but standard deviation of sample āerrorā - ^ĻĻµ2)
This is one of the reasons why log-scaling work. Since log-scaling compress the scale of data, it helps reducing variability, when the value is too spread out. Log-scaling mitigates skewness and transforms data to fit linear assumptions better.
- Denominator:
- Larger sample will make standard error of the slope smaller.
- Larger standard deviation of x will make standard error of the slope smaller.
All of the above seems intuitive, except for one thing - the last one. Isnāt it better to have āsmallerā variance of the data?
However, larger variance in x implies having more āvariousā data. If X covers more wider range of values, we have more information.
Wait, then how is the accuracy of āpredictionā with this model like?
Now we know how to evaluate our model, and what kind of factors impact these accuracies. However, the standard errors we had looked at are about āerrorsā within the model - how far are data in general variating from the fitted line? How far is our predicted line, from the actual ātrue line?ā
Then, how do we measure the accuracy of āpoint estimationā, if we are estimating unseen future Y value given some X value?
Sampling Errors
When making prediction, we need to consider Sampling Error. This is because, our model only took āsampleā of data, not entire population (which is impossible) - so our estimator of Ī²0,Ī²1 use sample values, and we need to make adjustment for this. We can decompose Prediction Error into āsampling errorā and irreducible error (from variaboility of Y, which is irrelevant to X. ): ef=ĻµfāErrorsampling=Yfāb0āb1Xf which could be decomposed as: =Ļµf+(Ī²0āĪ²1Xf)ā(b0āb1Xf) So, here the sampling error could be represented as: ((b0āĪ²0)+(b1āĪ²1)Xf) which is error between āestimated model and true lineā parameters. Letās leave computation to the computer. The standard error of point estimation, Var(ĖY) is computed as the following: Spred=sā1+1N+(XfāĖX)2(Nā1)s2X Here, s is standard error of regression, which is ^ĻĻµ2=s2=1Nā2āNi=1e2i=SSENā2
So, how is the standard error decomposed?
- sā1 which is variation of epsilon) is SE of regression, which is unrelated to X
- 1N+(XfāĖX)2(Nā1)s2X is the part which is related to X
Here are 3 implications, which are similar to all other standard errors we had looked into.
- Larger N matters
- Larger Sx matters (More variability of sample data X)
- As Xf is far from ĖX ā larger S_pred
Only the last one is the new intuition. Drawing an illustration of the plot actually helps more intuitively understanding why:
Look at the green line, X1 and X2, X2 is far from mean of X, and we can see how point estimation in X2 is more far from true value, compared to X1! (Remember, Least Square Error estimator should pass through mean of X, and Y.)
Now we have covered what decides āMore Accurateā estimation.
Based on these baselines, I will try to deal with Logistic Regression, and Multi-variate Regressions later.