For Multiple Linear Regression, there are more than 1 IVs.
Assumptions of a Linear Regression:
There are 2 types of variables:
When we need to add categorical variables to our equations in the ML models, we need to create dummy variables. Steps involved:
In the ML Model, include only one of the dummy variables otherwise it leads to the Dummy Variable Trap
We never include both the dummy variables. This is because we are basically duplicating a variable. This is because $D_2 = 1 - D_1$. The phenomenon where one or several independent variables in a linear regression predict another is called multicollinearity as a result of this effect, the model cannot distinguish between the effects of $D_1$ from the effects of $D_2$. Thus, it won't work properly. According to the math, you cannot have the constant and both the dummy variables in your model at the same time.
Always omit 1 dummy variable
Example: If you toss a coin
Hypothesis Testing
$H_0$: This is a fair coin (Null Hypothesis)
$H_1 \mbox{ or } H_a$: This is not a fair coin (Alternative Hypothesis)
p-value is the probability of something happening in the null hypothesis universe
As the number of columns (factors) grow, we will need to decide what columns we will keep and what we will throw out. We need to throw out columns because GARBAGE IN = GARBAGE OUT
5 Methods of Building Models:
Note: Stepwise Regression refers to Nos. 2, 3 & 4 or just No. 4
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('50_Startups.csv')
dataset.head()
| R&D Spend | Administration | Marketing Spend | State | Profit | |
|---|---|---|---|---|---|
| 0 | 165349.20 | 136897.80 | 471784.10 | New York | 192261.83 |
| 1 | 162597.70 | 151377.59 | 443898.53 | California | 191792.06 |
| 2 | 153441.51 | 101145.55 | 407934.54 | Florida | 191050.39 |
| 3 | 144372.41 | 118671.85 | 383199.62 | New York | 182901.99 |
| 4 | 142107.34 | 91391.77 | 366168.42 | Florida | 166187.94 |
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X[:5])
[[165349.2 136897.8 471784.1 'New York'] [162597.7 151377.59 443898.53 'California'] [153441.51 101145.55 407934.54 'Florida'] [144372.41 118671.85 383199.62 'New York'] [142107.34 91391.77 366168.42 'Florida']]
We do not need feature scaling in Linear Regression Models because each feature has a co-efficient and the co-eff will compensate for features that have higher values than others.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X[:5])
[[0.0 0.0 1.0 165349.2 136897.8 471784.1] [1.0 0.0 0.0 162597.7 151377.59 443898.53] [0.0 1.0 0.0 153441.51 101145.55 407934.54] [0.0 0.0 1.0 144372.41 118671.85 383199.62] [0.0 1.0 0.0 142107.34 91391.77 366168.42]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
The class that we import for multiple linear regression will automatically avoid the dummy variable trap, so we don't have to do it.
If we wanted to avoid the dummy variable trap manually:
X = X[:, 1:]
We don't have to check for best implementation of models using Backward Elimination etc too, because the class does it for us.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
LinearRegression()
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1))
[[103015.2 103282.38] [132582.28 144259.4 ] [132447.74 146121.95] [ 71976.1 77798.83] [178537.48 191050.39] [116161.24 105008.31] [ 67851.69 81229.06] [ 98791.73 97483.56] [113969.44 110352.25] [167921.07 166187.94]]
$x_0 = 1$
X[:5]
array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
[1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
[0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
[0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
[0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42]], dtype=object)
import statsmodels.formula.api as sm
X = X[:, 1:] # Dummy Variable Trap
X[:5]
array([[0.0, 1.0, 165349.2, 136897.8, 471784.1],
[0.0, 0.0, 162597.7, 151377.59, 443898.53],
[1.0, 0.0, 153441.51, 101145.55, 407934.54],
[0.0, 1.0, 144372.41, 118671.85, 383199.62],
[1.0, 0.0, 142107.34, 91391.77, 366168.42]], dtype=object)
# Adding a column of 1s for x0, intercept is not included in this model
X = np.append(arr=np.ones((50, 1)).astype(int), values=X, axis=1)
X[:5]
array([[1, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
[1, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
[1, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
[1, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
[1, 1.0, 0.0, 142107.34, 91391.77, 366168.42]], dtype=object)
# Matrix which contains only features which are optimal for the model
x_opt = X[:, [0, 1, 2, 3, 4, 5]]
x_opt[:5]
array([[1, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
[1, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
[1, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
[1, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
[1, 1.0, 0.0, 142107.34, 91391.77, 366168.42]], dtype=object)
# OLS - Ordinary Least Squares
# regressor_ols = sm.ols(endog=y, exog=x_opt.astype(int)).fit()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /var/folders/mj/365hbdq166n0n8b1hxp26g9h0000gn/T/ipykernel_97789/89138855.py in <module> 1 # OLS - Ordinary Least Squares ----> 2 regressor_ols = sm.ols(endog=y, exog=x_opt.astype(int)).fit() TypeError: from_formula() missing 2 required positional arguments: 'formula' and 'data'