Linear Regression Model
Terms
Dependent Variable
- Dependent variables are measured or tested and depend on independent variables.
Independent Variable
- Independent variables are manipulated and believed to have an effect on the dependent variable.
Statistical Significant Effect
- Statistical significant variables are unlikely to have occurred by chance, likely to be real (not due to a random chance).
Linear Regression

- Statistical model that helps model the impact of a unit change in a variable(independent) on the values of another target variable (dependent), when their relationship is linear in nature.
- TYPES:
- 1 Independent Variable
Simple Linear Regression. - 2 or more Independent Variables
Multiple Linear Regression.
- 1 Independent Variable
Simple Linear Regression
- The formula for simple linear regression is:
- Where:
is the dependent variable (the outcome you are trying to predict). is the independent variable (the predictor). is the y-intercept of the regression line. is the slope of the regression line. is the error term (the difference between the observed and predicted values).
Multiple Regression Formula
- The formula for multiple regression is:
Where:
-
-
-
-
-
Linear Regression Estimation
Ordinary Least Squares (OLS)
- Estimation technique used for estimating unknown parameters in a linear regression model to predict the response/dependent variable.
- OLS aims to find the best fitting regression line by minimizing the sum of squared errors.
- We estimate the error term as residuals and minimize the sum of squares of residuals.
- The OLS regression model for simple linear regression is given by:
- where:
is the dependent variable (the outcome you are trying to predict). is the independent variable (the predictor). is the y-intercept of the regression line. is the slope of the regression line. is the error term (the difference between the observed and predicted values).
Linear Regression Assumptions (5)
A1 - Linearity
- model is linear in parameters (aka) relationship between dependent variables and independent variables is linear.
-
Solution
- Check by plotting residuals to fitted values
if not linear estimate will be biased and hence linearity assumption is violated. - Use more flexible models (
: tree based models).
- Check by plotting residuals to fitted values
A2 - Random Sample
- all observations in the sample are randomly selected.
- Check by plotting residuals then taking the mean of the residuals
if mean is not around 0 OLS is biased and hence random sample assumption is violated.(Systematically over/under predicting the variable).
A3 - Exogeneity
- each independent variable is uncorrelated with the error terms.
- (aka) independent variables are not affected by error terms in the model
- (or) independent variables are assumed to be determined independent of errors in the model.
- It is a key assumption
allows to interpret estimated coefficient represent true causal effect of independent variables on dependent variables. variables when satisfying this assumption exogeneous - else
endogenous. - Endogeneity
Independent variables corelating with error terms. - Causes -
- Omitted Variable Bias
- important vector of a dependent variable is not included in the model.
- Reverse Causality
- dependent variable affects the independent variable.
A4 - Homoscedasticity/Homogeneity of Variance
- variance of all error terms in constant.
- Importance : to use statistical techniques and make inferences about parameters of the model. If errors not homoscedastic
results of techniques misleading/invalid heteroscedasticity. - Heteroscedasticity
Variance of all error terms is NOT constant. - Coefficient (might be)
Accurate. - But, Corresponding standard error, Student T test, P value, Confidence intervals
Not Accurate.
- Coefficient (might be)
A5 - No Perfect MultiCollinearity
- there are no exact linear relationships between the independent variables.
- Occurs due to high correlation between two or more independent variables
leads to highly unstable/unreliable estimate of the model. - Perfect MultiCollinearity
Independent variables perfectly correlated with each other one variable perfectly predicted form the others. - Estimated Coefficient of model
infinite/undefined.
Solution
- Remove 1 or more variables to avoid having collinearity in the model.
Pros and Cons of Linear Regression Model
Pros
- Simple Model
- Computationally Efficient
- High interpretability
- Able to handle missing data
Cons
- Overly-simplistic
- Many Assumptions
- Assumed Linearity
- Prone to Outliers