# Simple and Multiple Linear Regression in Python

Generally, Linear Regression is used for predictive analysis. It is a linear approximation of a fundamental relationship between two or more variables.

## Main processes of linear regression

- Get sample data
- Design a model that works best for that sample
- Make prediction for the whole population

## Main uses of regression analysis

- Finding the strength of predictors
- Forecasting an effect
- Trend forecasting

## Some types of linear regression analysis

### Simple Linear Regression

One dependent variable i.e. interval or ratio ,and one independent variable i.e. interval or ratio or dichotomous

### Multiple Linear Regression

One dependent variable i.e. interval or ratio, and two plus independent variables i.e. interval or ratio or dichotomous

### Logistic Linear Regression

One dependent variable i.e. dichotomous, and two plus independent variables i.e. interval or ratio or dichotomous

### Ordinal Regression

One dependent variable i.e. ordinal, and one plus independent variables i.e. nominal or dichotomous

### Multinomial Regression

One dependent variable i.e. nominal, and one plus independent variables i.e. interval or ratio or dichotomous.

## Types of Variables in Linear Regression

In linear regression, there are two types of variables:

- Dependent Variable
- Independent Variable

Dependent variables are those which we are going to predict while independent variables are predictors.

Let’s briefly explain them with the help of example.

y = F(x_{1},
x_{2},x_{3},…………….. x_{k})

In above
equation, y is dependent variable which is a function of independent variables
x_{1} to x_{k}.

The population formula of simple linear regression model is given below: –

Look at the above equation, y is dependent variable, *β _{0}* is regression constant,

*β*is the coefficient that quantifies the effect of independent variable on dependent variable,

_{1}*x*sample data for independent variable and

_{1}*ε*is the error of estimation.

Now we take an example to understand this equation well, for instance, income is dependent variable i.e. y and education is independent variable i.e. *x _{1}* then we say that income will definitely depend on education, more education will ensure the higher income.

Therefore, error of estimation is the actual difference between the observed income and the income the regression predicted. However, an average error of estimation is zero.

Simple linear regression equation is given below.

## Difference between Regression and Correlation

Regression | Correlation |

It is used to measure how one variable effect the other variable | It is the relationship between two variables |

It is used to fit a best line and estimate one variable on the basis of another variable | It is used to show connection between two variables |

In regression, both variables are dissimilar | There is no difference between dependent and independent variables |

One way | p(x,y) = p(y,x) |

Line | Single point |

## Python Packages Installation

Python libraries will be used during our practical example of linear regression.

To see the Anaconda installed libraries, we will write the following code in Anaconda Prompt,

C:\Users\Iliya>conda list

We can also install the more libraries in Anaconda by using this code.

C:\Users\Iliya>conda install numpy

Before we go to start the practical example of linear regression in python, we will discuss its important libraries.

### NumPy

It is a library for the python programming which allows us to work with multidimensional arrays and matrices along with a large collection of high level mathematical functions to operate on these arrays.

### Pandas

It is a software library for the python programming for data manipulation in a tabular form and analysis.

### Matplotlib

It is 2D plotting library for python programming which is specially designed for visualization of NumPy computation.

### SciPy

It is open source python library which is used for scientific and technical computing. It contains modules for optimization, linear algebra, integration, image processing, machine learning.

### Seaborn

It is a python data visualization library based on matplotlib. Seaborn offers a high level interface for drawing attractive and informative graphics.

### Statsmodels

It is a python package which permits users to explore data, estimate statistical models and execute statistical tests.

### Scikit-learn

It is free software machine learning library for python programming.

## Practical example of Simple Linear Regression

Import the relevant libraries

Load the data

Now we load the data in .csv format in the same folder where regression_example.ipynb file saved and also check the data what is inside the file as shown in figure.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm

In order to show the informative statistics, we use the describe() command as shown in figure.

data.describe()

Now we define the dependent and independent variables. In our example, ** code** (allotted to each education) is independent variable whereas

**is dependent variable.**

*salary*y = data['salary'] x1 = data['code']

In order to explore the data in shape of scatter plot, first we define the horizontal axis and then vertical axis, see this figure.

Now we add a constant means we are adding a new column which consists of only 1s.

x = sm.add_constant(x1)

Fit the model according to the Ordinary Least Squares (OLS) method with a dependent variable ‘y’ and an independentvariable ‘x’

results = sm.OLS(y,x).fit()

Finally, we print a summary of the regression.

results.summary()

Now we are going to create a scatter plot

plt.scatter(x1,y)

then, define the regression equation yhat = 5914.2857*x1+6466.6667

and now plot the regression line against the independent variable i.e. code (used for education)

fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')

Now, label the x-axis and y-axis

plt.xlabel('Education', fontsize = 20) plt.ylabel(Salary, fontsize = 20) plt.show()

Now, look at the output result in below figure . This is the complete code.

plt.scatter(x1,y) yhat = 5914.2857*x1+6466.6667 fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line') plt.xlabel('Education', fontsize = 20) plt.ylabel(Salary, fontsize = 20) plt.show()

**Interpret the Regression Results**

Now, put the following lines of code to interpret the regression results.

x = sm.add_constant(x1) results = sm.OLS(y,x).fit() results.summary()

** Salary** is dependent variable

** R-squared **shows the fit of the model. Its values range from 0 to 1. In our example, R-squared value is 0.911. It is pertinent to mention here that higher value indicate a better fit.

Simple Linear Regression is given by,

In our example, ** const** i.e. b

_{0}is 5152.5157

** Salary** i.e. b

_{1}is 6240.5660

** Std err** shows the level of accuracy of the coefficient. Lower the std error, higher the level of accuracy.

** P > | t |** is p-value. This value is less than 0.05 is considered to be statistically important.

Therefore,

Salary = 5152.5157 + 6240.5660 × code

If code = 2 then salary will be

17633.6477 = 5152.5157 + 6240.5660 × 2

Hence, according to our model, the expected salary of employee whose education is FA is 17633.65 that is the predictive power of linear regression.

In case of null hypothesis of this test, Beta is equal to zero (H_{0} : β = 0) which means that coefficient equal to zero. If the coefficient is zero for the intercept be zero that is then the line crosses the y-axis at the origin as shown in figure.

plt.scatter(x1,y) yhat = 5914.2857*x1+0 fig = plt.plot(x1,yhat, lw=4, c='red', label='regression line') plt.xlabel('Education', fontsize = 20) plt.ylabel('Salary', fontsize = 20) plt.xlim(0) plt.ylim(0) plt.show()

If b_{1}= 0 then ŷ = b_{0} Therefore, graphically, this variable will not be considered for the model.

Therefore, we conclude that the regression line horizontal is always going through the intercept value.

## Practical example of Multiple Linear Regression

**Import the relevant libraries and load the data**

In order to shown the informative statistics, we use the describe() command as shown in figure.

Now we define the dependent and independent variables. In our example, ** code** (allotted to each education) and

**are independent variables, whereas,**

*year***is dependent variable.**

*salary*In order to explore the data in shape of scatter plot, first we define the horizontal axis and then vertical axis as shown in figure.

**Interpret the Regression Results**

Now, we can easily compare the both results of regression model with one or more variables.