Processing math: 100%

Data Science, Analytics, Deep Learning, and tech products. Publish & share on what I learned.

All About Linear Regression and Evaluation - [DS / ML Back to the Basics Series #2]

šŸ—’ļøRelated Post: Statistical Implication of Parameter Estimation and Standard Error

This time, I will cover ā€˜Linear Regression Modelā€™. I know there are bunch of great resources on Linear Regression itself, so I will focus more on some of significant implications on Linear Regression, and some of the thoratical backgrounds behind methods we might have used without deeper understanding. Some of the questions answered would be:

  • What does Linear Regression imply in terms of conditional prediction? (Why does it work better than mean estimation?)
  • What are differences between just showing high correlation, and doing linear regression?
  • What impacts accuracy of linear regression prediction? (Or, standard error?)
  • Why do we do the log-scale? - Is it just because numbers are so large?

Last time, we have identified:

  1. ā€˜Modelingā€™ impies finding out any kind of functions, that can describe and predict ā€˜real world patternsā€™
  2. Since it is impossible to get precise value of ā€˜true patternsā€™, we are using ā€˜estimatorā€™ (e.g. Sample mean as an mean estimator, to estimate mean of population data.)
  3. ā€˜Standard Errorā€™ tell us about confidence interval of the ā€˜true patternā€™ - e.g. even if we somehow identified ā€˜true meanā€™ of any elephants, if standard deviation is too large, mean would not help predicting future values ā€˜with confidenceā€™.
  4. So, knowing standard deviation is important, but since we canā€™t not figure out ā€˜population standard deviation (std of true patterns)ā€™ we utilize sample standard deviation : ā€˜Standard Errorā€™. Smaller standard error implies, narrower confidence interval - which means we can use estimator to predict future value with ā€˜more confidenceā€™.

This time, letā€™s look at one of the most widely used and important model - linear regression.


Linear regression model - What is the ā€˜correctā€™ way to use them?

I think linear regression is one of the modeling method which is widely used, and many people will be used to itā€™s concepts, formulation, or how to implement them in Python or R codes. However, it is more important to implement them with better understanding. For any kind of data, that ā€˜seems to have linear correlationā€™ if we put them into the linear regression model, it gives some prediction. However, to what extent could we trust this outcome? What are the factors that impact accuracy of prediction based on linear regression?


From one of the previous post on MLE (maximum likelihood estimator), we had looked into linear regression in terms of Bayesian approach.

Recap: We can interpret ā€˜Likelihoodā€™ as: given the observation or data, how likely is the data coming from certain model? For instand, when we toss the coin 10 times and see head 2 times, it is more likely to think probability of getting head is lower than 50%. Here, if we say Īø is probability of getting head, L(Īø)=Īøx(1āˆ’Īø)nāˆ’x where x=2. Likelihood will be maximized then Īø is around 0.2.

Like we see from the ilustration below, it is more likley to infer that the data below, are generated from ā€˜true lineā€™ model A, rather than model B. The MLE approach tried to get a model which maximizes likelihood - and we found out that it is same as linear regression model with regularization.

image

This time, letā€™s try to understand linear regresssion itself.

Simple Linear regression - Itā€™s All About Conditional Prediction

Screenshot 2024-11-23 at 4 39 24ā€ÆPM

I love using easy examples, so letā€™s keep using example of elephants. I think many of people are used to illustrations like above. Linear regression is modeling pattern between two (or more) variables. However, it is also important to understand that it is ā€˜expanding prediction of Y into conditional prediction given X.ā€™Screenshot 2024-11-23 at 4 45 47ā€ÆPM

Letā€™s remove the X axis, projecting data toward Y axisd. Now we only have sample statistic data of ā€˜weightā€™, so the best predictor for weight of ā€˜all elephants in the worldā€™ would be a sample mean - this is exactly same as the mean estimation weā€™d done form the previous post. However, when additional information ā€˜Ageā€™ is given (X), we can intuitively see that our model coule be more precise. Why? because based on their age elephants tend to have different weight, but mean prediction does not reflect this information. If we split up group of elephants by range of their age group, we can see that their ā€˜group meanā€™ moves.

Screenshot 2024-11-23 at 4 53 46ā€ÆPM

So, what is the conclusion here?

  • Linear regression is a conditional perdiction, predicting value of Y given that X=x, where x is the value of data point.
  • Also, we can recognize that ā€˜linear regressionā€™ (or conditional prediction) is better than simple mean prediction when Y moves as X moves - which means, X and Y are somehow ā€˜correlatedā€™.

Quick Review on Correlation and Covariance

Letā€™s co back to statistic class for a momne. ā€˜Covarianceā€™ measures how random variables X and Y moves together (increases, or decreases together.) Cov(X,Y)=āˆ‘ni(Xiāˆ’Ė‰X)(Yiāˆ’Ė‰Y)n Have you thought of implication behind this formulation? Covariance is high when, for each of data points, value of x is far from the mean, value of your is also far from the mean. If deviation from mean is large for one variable, but small for the other variable, covariance will be canceled out.

Look at the illustrations below. Two cases have same mean of X and Y, but only for the first case two variables are correlated. We could see that for the first case, as X deviates from the mean, Y also deviates from the mean. However, for the second case, for many data points Y value does not deviate from mean while X deviates from the mean.

Screenshot 2024-11-23 at 4 53 46ā€ÆPM Corr(X,Y)=Cov(X,Y)ĻƒXĻƒY For sample correlation: Corr(X,Y)sample=SXYsXsY

Correlation is basically computed by dividing Covariance, with each random variableā€™s standard deviation. By doing this, we are standardizing covariance between the value of -1 and 1. Correlation seems to be representing linear relationship between random variables in some sense, but there is one problem: Correlation is unitless, so we cannot use them for prediction of future value. For instance, if age and weight of elephants show high correlation of 0.8. What does this number 0.8 imply? It only shows ā€˜how correlation is strongā€™ as a relative metric, but does not tell us anything about ā€˜how to predict weight, given ageā€™. However, the correlation itself is related to the linear regression coefficient.

Linear regression and correlation

Y=b0+b1X


Now, letā€™s look at the linear regression model in detail. We are fitting a line on dataset by representing Y as linear function of X. b0 and b1 are two parameters that decides how to draw the line. Based on given data, what we are doing here is same as what we have done from the mean estimation:

  1. Assuming there are linear ā€˜true lineā€™ that generated the data
  2. Taking given data as sample, we are ā€˜estimatingā€™ the ā€˜true lineā€™ equation.
  3. If our estimated line seems to be appropriately accurate (which means, it shows small standard error), we can use this estimated line to predict unseen future value.

Values we are estimating will be b0, and b1. From our given dataset, the best estimator should have least square error, as we also mentioned from this post. I will skip how does Least Square Error solution is derived, but Least Square Solution is known as: b1=āˆ‘Ni(Xiāˆ’Ė‰X)(Yiāˆ’Ė‰Y)āˆ‘Ni(Xiāˆ’Ė‰X)2=sXYs2X b0=Ė‰Yāˆ’b1Ė‰X

Doenā€™t this look familiar? This is sample correlation multiplied by sY over sX Corr(X,Y)samplesYsX This implies that:

  • The regression coefficient (slope) is, adjusting (scaling) correlation into the unit of Y: divided by STD of X and multiplied by STD of Y.
  • In linear regression, b1=0 means two variables are not related, so we cannot predict Y with X. If correlation between X and Y is 0, this will make b1 also 0. This explains intuition that, linear regression model will only be valid if variables are correlated.

Two more insights on residual e, Ļµ, and Lease Square Error solution.

Least Square Solution

Least Square Solution is knows as: b1=āˆ‘Ni(Xiāˆ’Ė‰X)(Yiāˆ’Ė‰Y)āˆ‘Ni(Xiāˆ’Ė‰X)2=sXYs2X

And, b0=Ė‰Yāˆ’b1Ė‰X

There are some more implication behind this. If we look at b0 formula, and plug it into Y=b0+b1X, we can find out that this Least Square Solution must pass (Ė‰X,Ė‰Y).

Screenshot 2024-11-23 at 4 39 24ā€ÆPM

More abot residual, and error

Now, letā€™s dive deep into Ļµ and e.

Screenshot 2024-11-26 at 5 44 47ā€ÆPM

From the illustration above, letā€™s say that the ā€˜Blue lineā€™ is the ā€˜ideal true linear lineā€™ which describes linear relatioship between age and weight of all elephants in the worlds. We are trying to ā€˜predictā€™ the line, from the given data. Here is a thing we need to keep in mind:* ā€˜True lineā€™ itslef also has ā€˜irreducible errorā€™, and we define this as Ļµ.

Even we assume that the ā€˜true lineā€™ is ideal line, it has error - unless all data points are exactly located on the true line. We call this ā€˜irreducible errorā€™, because even we predict the true line precisely (which wonā€™t be possible, unless we have infinite data point) we cannot reduce this error. This is why we formulate ā€˜true line asā€™

Y=Ī²0+Ī²1X+Ļµ, and our prediction as Ė†Y=Ī²0+Ī²1X+e. Here are two important insights for residual for ā€˜least square estimatorā€™:

  1. Mean of residual should be 0.
  2. Correlation between e and X should be 0. (Corr(e,X)=0)
  3. Correlation between Ė†Y and e should be zero. (Since Ė†Y is dependent on X, if 2 is true, this is also true. )

The first one seems to be more intuitive: If the regression model has ā€˜least square errorā€™, sum of itā€™s errors will be canceled out to be zero. How about the second one? It implies ā€˜Error should be consistent over entire range of Xā€™.

Decomposition of error term - SSE, SSR, TSS, and R Square

From the last post, we discussed that ā€˜Variabilityā€™ is important in measuring predictability (or accuracy) of the model (estimation). This is why we focus on variance, or standard error. So, letā€™s look at how could ā€˜Variability of Yā€™ be decomposed.

We know that:

  • āˆ‘Ni=1ei=0
  • corr(e,X)=0
  • corr(Ė†Y,e)=0

Therefore, we can write Var(Y)=Var(Ė†Y+e)=Var(Ė†Y)+Var(e)+2Cov(Ė†Y,e)=Var(Ė†Y)+Var(e) Here, ā€˜variability of Yā€™ is decompsed with variability of Ė†Y and variability of e. We discussed that ā€˜error termā€™ is related to ā€˜irriducible errorā€™. Therefore only the Var(Ė†Y) is the variability related to the regression model we build - which, we can explain with our regression model. So, we may wish variability related to our regression will be higher that the irreducible error - cause that means our model explain huge part of the variances in Y, and our model fits better.

This is what ā€˜R squre all aboutā€™

Why R Square is not ā€˜accuracy measureā€™, but measure of ā€˜fitā€™?

It is easy to be too obsessed with R Square, and misinterpret it as ā€˜accuracy measureā€™. However, if we search the definition, R Square is defined as a coefficient of determination. This is the definition from Wikipedia: R Square is the proportion of the variation in the dependent variable that is predictable from the independent variable(s)

We had seen that Var(Y)=Var(Ė†Y)+Var(e). Since sample variance of a random variable is 1nāˆ’1āˆ‘(xiāˆ’Ė‰x), we can rewrite the formulation as: āˆ‘Ni=1(Yiāˆ’Ė‰Y)2=āˆ‘Ni=1(^Yiāˆ’Ė‰Y)2+āˆ‘Ni=1e2i These are the definition of TSS, SSR, SSE TSS=SSR+SSE


\[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\]

Now, do you see the intention, and motivation behind R Square?

  • R Square does not answer the question how is the model accurate?
  • It answers: ā€˜How much of variability is explained by the model?ā€™

This is about ā€˜goodness of fitā€™, not about accuracy of prediction.

Then, how do we evaluate accuracy? - Always Standard Error.

So, how do we evaluate the model in terms of the accuracy of prediction? We had done very similar thing for the mean estimation from the previous post - Standard Error. Only the different thing is that, this time we have 3 parametors:

  • Ļƒe: Standard error of regression. We need to estimate, how large is the ā€˜standard deviationā€™ of residuals are.
  • Ī²0 and Ī²1: We need to evaluate how is our estimated parameters distributed, and estimate their standard deviation through standard error of estimation.

This is why I covered the concept of standard error in detail, with simple example of mean estimation from the previous post. Whatever we are estimating, what matters is ā€˜standard deviationā€™ - higher the standard deviation, it means that the ā€˜range (interval)ā€™ that unseen future data could likely be located will be wider. For 95% confidence interval, 95% of data are located in estimated point Ā±2Ļƒ, so larger sigma will cause wider interval. Screenshot 2024-11-26 at 6 11 55ā€ÆPM

I wonā€™t go deep into the calculation details, but each of the standard errors could be induced as the following

  • Ļƒe: We use sample standard deviation instead: ^ĻƒĻµ2=s2=1Nāˆ’2āˆ‘Ni=1e2i=SSENāˆ’2

  • ĻƒĪ²1 We use sample standard deviation here too: ^ĻƒĪ²1=āˆšĻƒ2Ļµ(Nāˆ’1)s2xā‰ˆāˆšVar(b1)=sb1=āˆšs2(Nāˆ’1)s2x

Donā€™t worry about the computation, when we do the regression analysis with Python, or R, it will compute and give all the numbers. However, the thing we need to know is ā€˜Standard Error indicates how each of the estimated parameters are credibleā€™

  • Lower SE for slope b1 implies that, ā€˜regression coefficientā€™ for two variables are credible.
  • Lower SE of regression means, if we make prediction based on the linear model, it is highly likely that actual value will be close enough to our predicted value!

One More thing - What matters for low SE, and why does log-scale works?

For the standard error ^ĻƒĻµ2, there is similar imlication we saw from mean estimation problem: As we have larger sample size, the standard error will diminish. How about standard error of the slope coefficient? āˆšs2e(Nāˆ’1)s2x

  • Numerator: Smaller sample standard deviation of ā€˜errorsā€™ will make standard error of the slope smaller. (This is not a standard deviation of the sample data, but standard deviation of sample ā€˜errorā€™ - ^ĻƒĻµ2)

This is one of the reasons why log-scaling work. Since log-scaling compress the scale of data, it helps reducing variability, when the value is too spread out. Log-scaling mitigates skewness and transforms data to fit linear assumptions better.

  • Denominator:
    • Larger sample will make standard error of the slope smaller.
    • Larger standard deviation of x will make standard error of the slope smaller.

All of the above seems intuitive, except for one thing - the last one. Isnā€™t it better to have ā€˜smallerā€™ variance of the data?

However, larger variance in x implies having more ā€˜variousā€™ data. If X covers more wider range of values, we have more information.

Wait, then how is the accuracy of ā€˜predictionā€™ with this model like?

Now we know how to evaluate our model, and what kind of factors impact these accuracies. However, the standard errors we had looked at are about ā€˜errorsā€™ within the model - how far are data in general variating from the fitted line? How far is our predicted line, from the actual ā€˜true line?ā€™

Then, how do we measure the accuracy of ā€˜point estimationā€™, if we are estimating unseen future Y value given some X value?

Sampling Errors

When making prediction, we need to consider Sampling Error. This is because, our model only took ā€˜sampleā€™ of data, not entire population (which is impossible) - so our estimator of Ī²0,Ī²1 use sample values, and we need to make adjustment for this. We can decompose Prediction Error into ā€˜sampling errorā€™ and irreducible error (from variaboility of Y, which is irrelevant to X. ): ef=Ļµfāˆ’Errorsampling=Yfāˆ’b0āˆ’b1Xf which could be decomposed as: =Ļµf+(Ī²0āˆ’Ī²1Xf)āˆ’(b0āˆ’b1Xf) So, here the sampling error could be represented as: ((b0āˆ’Ī²0)+(b1āˆ’Ī²1)Xf) which is error between ā€˜estimated model and true lineā€™ parameters. Letā€™s leave computation to the computer. The standard error of point estimation, Var(Ė†Y) is computed as the following: Spred=sāˆš1+1N+(Xfāˆ’Ė‰X)2(Nāˆ’1)s2X Here, s is standard error of regression, which is ^ĻƒĻµ2=s2=1Nāˆ’2āˆ‘Ni=1e2i=SSENāˆ’2

So, how is the standard error decomposed?

  • sāˆ—1 which is variation of epsilon) is SE of regression, which is unrelated to X
  • 1N+(Xfāˆ’Ė‰X)2(Nāˆ’1)s2X is the part which is related to X

Here are 3 implications, which are similar to all other standard errors we had looked into.

  1. Larger N matters
  2. Larger Sx matters (More variability of sample data X)
  3. As Xf is far from Ė‰X ā†’ larger S_pred

Only the last one is the new intuition. Drawing an illustration of the plot actually helps more intuitively understanding why:

Screenshot 2024-12-16 at 6 16 12ā€ÆPM

Look at the green line, X1 and X2, X2 is far from mean of X, and we can see how point estimation in X2 is more far from true value, compared to X1! (Remember, Least Square Error estimator should pass through mean of X, and Y.)


Now we have covered what decides ā€˜More Accurateā€™ estimation.

Based on these baselines, I will try to deal with Logistic Regression, and Multi-variate Regressions later.