đď¸Related Post: MLE - Maximum Likelihood Estimator
Image: A spoiler for todayâs topic
In the first post of DS / ML Back to the Basics series, I would like to go over what does it mean when we do the âmodelingâ, and how do we evaluate if the model is good, or bad. I will use âmean estimatorâ to explain statistical implications, and cover Simple Linear Regression model in the follow-up post. Maybe âmean estimationâ could be seem as very basic concept, but I will focus more on the statistical implication of estimation, and standard error.
What does is mean to do âmodeling?â
When I was an undergraduate student, I learned about a variety of machine learning models that were transforming the field. I was trained to implement these models in Python and apply them to sample datasets. It felt like magic: you write code, input training data, train the model, and it predicts the result.
However, after working as a full-time product manager in the IT industry for three years, I discovered that in real-world business scenarios, simply applying these models rarely works as expected.
Through my experience in ML projects in an AI lab, collaborating with data scientists as a product manager, and studying statistical analysis in depth, I now understand that this happens when predictive models are used without understanding the implications behind them. Modeling is not about building a magical pipeline that produces perfect outputs just because the model is the latest and most advanced. Itâs about 1) effectively predicting real-world statistical patterns, 2) using a sample dataset that ideally reflects those patterns, and 3) minimizing possible errors.
Letâs think of modeling as âestimationâ
We talked a little bit about this in the previous post - MLE - Maximum Likelihood Estimator. We identify certain patterns from observed phenomena. While we cannot be certain where these patterns originate or what causes them, if we can estimate the patterns accurately, it may be possible to make valid predictions about unseen data.
Example of âmeanâ prediction.
We learned a lot about sample means and population means in statistics classâeven back in high school. Letâs think of the sample mean in terms of âestimation.â For example, letâs say we want to answer the question, âHow heavy are elephants?â This might be a difficult question because individual elephants vary in weight. However, we definitely know that elephants are much heavier than dogs or cats.
The best way to describe âon average, how heavy are elephants?â would be to take the mean weight of all elephants. Since itâs impossible to measure the entire population, we use a sample. Here, we can say the âsample meanâ is an estimator for the âpopulation meanââimplying that the âpopulation meanâ is the actual (or ideal) real-world pattern describing the average weight of elephants, and we are estimating it through the âsample mean.â
Based on what we learned in statistics classes, assuming that weight is normally distributed, we can say:
- Random variable for the weight of elephants: $Y \sim N(\mu, \sigma)$
- Sample variable follows: $\bar{Y}, s_Y$
Here, $\mu$ is the population mean (i.e., the actual average weight of elephants as determined by some real-world pattern that we cannot observe directly), while $\bar{Y}$ is an estimator for what weâre interested in.
So, what we have done here is to estimate the statistic of a real-world pattern using a limited sample dataset. This is what we do for every âestimationâ or modeling exercise. But we still have one questionâ Can we just accept this estimator? How precise is it?
Evaluating estimation.
What should we examine to assess the precision of an estimation? Letâs think of it this way: look at the two cases below. Both cases have the same mean; however, the data in the second case is much more spread out, meaning it has a higher variance from the mean.
This implies that if the standard deviation is high, data points are more likely to be far from our estimated mean. (In a standard normal distribution, approximately 95% of data falls within 2 * std of the mean.) Therefore, if the standard deviation of our estimating value is low, we can say our prediction is more efficient in terms of predicting unseen future data!
The, how is standard deviation of mean estimator like? - True implication of CLT
Letâs go back to statistics class. Now itâs time for the Central Limit Theorem to do its job. Assuming $Y$ is normally distributed, the Central Limit Theorem states that $\bar{Y} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$
if the sample size $n$ is large enough. Here, we see that the variance (or standard deviation) of our mean estimator is 1) determined by the population variance, and 2) divided by sample size. What are the implications of this?
- With a larger sample size, our estimator (sample mean) will approximate to the population mean.
- Standard deviation (actually, standard error - Iâll explain this later) will also diminish, resulting in more precise estimation.
- However, if the population standard deviation itself is largeâmeaning if $\sigma$ is largeâthere will be limitations to the precision of mean estimation.
These implications are more apparent if we consider it this way: as our sample size approaches the population size, we are effectively sampling the entire population. Therefore, as the sample size increases, the sample mean approaches the population mean. This is why we need large amounts of âBIGâ data.
However, this doesnât mean our estimation will always be precise just because we use a large dataset. What if the âtrue patternâ of the real world has a huge variance? For example, what if what weâre trying to estimate is not elephant weight, but the âincome of people living in California?â California has a large population with a wide range of incomes, so even if we take the population mean, it may not adequately represent a âpattern.â Therefore, for this type of prediction problem, using a sample mean predictor would not work well, even if we take a large sample from California.
Why does âdividing segmentâ work better when estimating statistics?
So, what can we do in these kinds of cases? Product managers or analysts might suggest âsegmenting the data.â The reason this approach works is that by dividing segments, we can reduce the population variance.
If we divide the âpopulationâ into different segmentsâe.g., by age groupâitâs more likely that people within the same age group will have a similar range of income. By defining segmented populations, we can reduce the variance of the population statistic. Although the maximum sample size is reduced (since the population is now divided into segments), this segmentation allows us to make better predictions by effectively narrowing down the population variance.
Letâs try this with a data example. Thereâs an older dataset that contains wage data along with demographic informationâthe ISLR Wage datasetâwhich weâll analyze using R.
install.packages("ISLR", repos = "http://cran.us.r-project.org")
library(ISLR)
# Load and view Wage dataset
data("Wage")
head(Wage)
This is how the data looks like. Letâs just take data from 2009 (since this is the most recent year from this dataset), and divide age group.
df <- Wage[Wage$year == max(Wage[, 'year']), ]
df$age_group <- ifelse(df$age < 20, "Immature",
ifelse(df$age <= 30, "Twenties",
ifelse(df$age <= 60, "Middle aged",
"Senior")))
df <- df[, c('age_group', 'wage')]
head(df)
Now, letâs compare total variance, and variaces for each age groups!
hist(df$wage, main = paste("All age wage - variance:", round(var(df$wage), 2)),
xlab = "Wage", ylab = "Frequency")
# Immature
immature_age = df[df$age_group == "Immature", ]$wage
hist(immature_age, main = paste("Under 20 wage - variance:", round(var(immature_age), 2)),
xlab = "Wage", ylab = "Frequency")
#Twenties
twenties_age = df[df$age_group == "Twenties", ]$wage
hist(immature_age, main = paste("20~20 wage - variance:", round(var(twenties_age), 2)),
xlab = "Wage", ylab = "Frequency")
#Middle aged
middle_age = df[df$age_group == "Middle aged", ]$wage
hist(middle_age, main = paste("30~59 - variance:", round(var(middle_age), 2)),
xlab = "Wage", ylab = "Frequency")
We can see that the variance for the total population was 1811âa relatively large value. However, by splitting age groups, we could reduce the variance significantly for the under-20 and 20s age groups. If we use the entire population for mean estimation, $\sigma_Y$, the population standard deviation of wage, is $\sqrt{\sigma_Y} = 42.5629$. In comparison, the standard deviation of the 20s age group segment is $\sqrt{\sigma_{Y=20}} = 29.32$.
In a normal distribution, 95% of data is located within $\text{mean} \pm 2\sigma$ (the confidence interval). Therefore, if we use the mean as the estimator, the 95% confidence interval in the first case is about 85, while in the second case, itâs much narrowerâaround 58.
Interestingly, the middle-aged group has a much larger variance than the total population. This suggests that we selected an inappropriate segment for mean prediction. This makes sense, as the 30â59 age range is quite broad for generalizing income patterns. To exaggerate, both myself two years from now and Elon Musk would fall into this group!
We may reduce the variance further by splitting this group more narrowly or by removing outliers when conducting estimation.
Thereâs still one more problem - $\sigma$ is unkown. Here comes the Standard Error.
I believe many readers have heard of âStandard Error.â Now itâs time to introduce what it is and why itâs important. We just discussed why the âStandard Deviationâ of the value weâre trying to estimate (in our example, $\bar{Y}$, which is the average weight of elephants in the world) is relevant. According to the Central Limit Theorem, we know how to determine this value: $\bar{Y} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$. However, the population variance, $\sigma$, is an unknown value. Therefore, we use the sample standard deviation as an estimator for the population standard deviation. \(\sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}} \approx s_{\bar{Y}}= \frac{s_{Y}}{\sqrt{n}}\) Therefore, âStandard Errorâ is an estimator of the âStandard Deviation of the value we are trying to estimate.â
Remember:
- If we have a lower standard error, it means our prediction is likely more accurate.
- To achieve a smaller standard error, we need a larger N (sample size), but
- There are limitations if the âStandard Deviation (or variance) of the population itselfâ is large.
Also, since ( n ) in the denominator of the standard error formula is square-rooted, as ( n ) increases, the impact of reducing standard error gradually diminishes.
Gradually diminishing effect of increasing N
Letâs try to plot this using entire Wage dataset. The following code plots how the variance of Wage changes, as the sample size increases:
sample_sizes <- seq(10, 2910, by = 100)
variance_computation <- function(s_size) {
temp_data <- Wage[sample(nrow(Wage), s_size), ]
var(temp_data$wage)
}
variances <- sapply(sample_sizes, variance_computation)
plot(sample_sizes, variances, main = 'Variance of Wage as sample size N increases', col = "blue", xlab = 'Sample size')
lines(sample_sizes, variances, col = "red", lwd = 2)
We can see that initially, the variance decreases quite dramatically as we increase the sample size. However, after reaching a sample size of 500, the improvement slows down and becomes negligible. This pattern resembles:
Coming up next: letâs look into âSimple Linear Regressionâ model
Now we have a better understanding of the true implications of âestimationâ and âstandard error.â But I hear someone sayingâŚ
Itâs just taking average for estimation! Itâs not a REAL modeling!
However, the definition of âmodelingâ is making a function, that takes input data, and output some estimation of target variable. So, mean estimator could be defined as: \(\hat{y} = f(x) = E[X] = \frac{1}{n}\sum_{i=1}^nx_i\)
It could also be considered a type of model. BUT, I totally get what you mean. In the next post, I will use the example of Linear Regression to explore how Standard Error works in evaluating regression models and how we can properly utilize regression models for better prediction. (Remember, models are not magic boxes!)