Multiple Linear Regression

$$y = b_0 + b_1.x_1 + b_2.x_2 + ... + b_n.x_n$$

For Multiple Linear Regression, there are more than 1 IVs.

Assumptions of a Linear Regression:

  1. Linearity
  2. Homoscedasticity
  3. Multivariate Normality
  4. Independence of Errors
  5. Lack of multicollinearity Check all the assumptions are true before applying the linear regression models.

Dummy Variables

There are 2 types of variables:

  1. Categorical (also known as qualitative variable): In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. E.g, state in which a start-up operates
  2. Numerical

When we need to add categorical variables to our equations in the ML models, we need to create dummy variables. Steps involved:

  1. Go through the column and find the number of differnet categories you have.
  2. For each category, create a new column
  3. For each row that says New York, put 1 in the NY column and 0 in the others (California)
  4. All the dummy variable columns work as switches (1 or 0)
$$y = b_0 + b_1.x_1 + b_2.x_2 + b_3.x_3 + b_4.D_1$$

In the ML Model, include only one of the dummy variables otherwise it leads to the Dummy Variable Trap

Dummy Variable Trap

We never include both the dummy variables. This is because we are basically duplicating a variable. This is because $D_2 = 1 - D_1$. The phenomenon where one or several independent variables in a linear regression predict another is called multicollinearity as a result of this effect, the model cannot distinguish between the effects of $D_1$ from the effects of $D_2$. Thus, it won't work properly. According to the math, you cannot have the constant and both the dummy variables in your model at the same time.

Always omit 1 dummy variable

P-value

Example: If you toss a coin

Hypothesis Testing
$H_0$: This is a fair coin (Null Hypothesis)
$H_1 \mbox{ or } H_a$: This is not a fair coin (Alternative Hypothesis)
p-value is the probability of something happening in the null hypothesis universe

Building a Model (Step-by-Step)

As the number of columns (factors) grow, we will need to decide what columns we will keep and what we will throw out. We need to throw out columns because GARBAGE IN = GARBAGE OUT

5 Methods of Building Models:

  1. All-in: Through in all your variables
    • Prior Knowledge: If you know that these variables are your true predictors
    • You have to
    • Preparing for Backward Elimination
  2. Backward Elimination:
    • S1: Select a significance level to stay in the model (eg $\alpha = 0.05$)
    • S2: Fit the full model with all possible predictors
    • S3: Consider the predictor with the highest p-value. If $p > \alpha$, go to STEP 4, otherwise go to FIN.
    • S4: Remove the predictor
    • S5: Fit the model without this variable*. (You need to re-build the model [fit], you cannot just remove the variable)
    • Go Back to S3: Do this until even the variable with the highest p-value is still less than your $\alpha$
    • FIN: Model is ready!
  3. Forward Selection:
    • S1: Select a significance level to stay in the model (eg $\alpha = 0.05$)
    • S2: Fit all simple regression models, $y \mbox{ ~ } x_n$. Select the one with the lowest p-value.
    • S3: Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have.
    • S4: Consider the predictor with the lowest P-value. If $p < \alpha$, go to S3, otherwise, go to FIN
    • FIN: Model is ready! (Keep the previous model)
  4. Bidirectional Elimination:
    • S1: Select a significance level to enter aand to stay in the model, e.g. $\alpha_{enter} = 0.05, \alpha_{stay} = 0.05$
    • S2: Perform the next step of Forward Selection (new variables must have: $p < \alpha_{enter}$ to enter)
    • S3: Perform ALL steps of Backward Elimination (old variables must have $p < \alpha_{stay}$ to stay). Go Back To S2
    • S4: No new variables can enter and no old variables can exit
    • FIN: Model is Ready!
  5. Score Comparison (All Possible Models):
    • S1: Select a criterion of goodness of fit (e.g. Akaike Criterion)
    • S2: Construct all possible regression models: $2^N - 1$ total combinations, $N$ -> Number of Columns
    • S3: Select the one with the best criterion
    • FIN: Model is Ready!

Note: Stepwise Regression refers to Nos. 2, 3 & 4 or just No. 4

Importing the libraries

Importing the dataset

Encoding categorical data

We do not need feature scaling in Linear Regression Models because each feature has a co-efficient and the co-eff will compensate for features that have higher values than others.

Splitting the dataset into the Training set and Test set

Training the Multiple Linear Regression model on the Training set

The class that we import for multiple linear regression will automatically avoid the dummy variable trap, so we don't have to do it.

If we wanted to avoid the dummy variable trap manually:

X = X[:, 1:]

We don't have to check for best implementation of models using Backward Elimination etc too, because the class does it for us.

Predicting the Test set results

Manual Backward Elimination

$$y = b_0.x_0 + b_1.x_1 + b_2.x_2 + ... + b_n.x_n$$

$x_0 = 1$

OLS Not Working!!!