In this article we are going to discuss machine learning with python with the help of a real-life example. Before we proceed towards a real-life example, just recap the basic concept of Linear Regression.
Usually, Linear Regression is used for predictive analysis. It is a linear approximation of a fundamental relationship between two (one dependent and one independent variable) or more variables (one dependent and two or more independent variables).
The main processes of linear regression are to get sample data, design a model that works finest for that sample, and make prediction for the whole dataset. Linear Regression is mainly used for trend forecasting, finding the strength of forecasters and predicting an effect.
There are various types of Linear Regression Analysis in which, Simple Linear Regression (One dependent variable and one independent variable), Multiple Linear Regression (one dependent variable and two or more independent variables), and Logistic Linear Regression (one dependent variable and two plus independent variables) are commonly used.
Let’s start with Simple Linear Regression with one dependent variable and one independent variable.
On the basis of the given data we will build a machine learning model that will predict the price of one Kg mangoes in upcoming years i.e. 2020 and 2021.
year | mangoes_price (in Rs.) |
2011 | 40 |
2012 | 50 |
2013 | 55 |
2014 | 60 |
2015 | 65 |
2016 | 70 |
2017 | 75 |
2018 | 80 |
2019 | 90 |
We can represent the values in aforementioned table as a scatter plot and then draw a straight line that best fits values on chart as shown in figure.
We can also draw multiple lines like this but we definitely select the one where the total sum of error is lowest.
Total sum of error can be calculated as
We have already learned in mathematics during high school days, y=mx+b, therefore, mangoes prices can be represented by the following equation.
Mangoes_price = m × year + b
Here, m is slope or gradient and b is intercept.
Now, let’s start coding in python, first we import the important libraries, such as pandas (for data manipulation in a tabular form and analysis), numpy (allows us to work with multidimensional arrays and matrices along with a large collection of high-level mathematical functions to operate on these arrays), mathplotlib (a 2D plotting library for python programming which is specially designed for visualization of NumPy computation) and sklearn (formally known as scikit-learn for data mining and data analysis) as shown in figure.
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
Load the dataset: N
ow we load the dataset i.e. mangoes_price.csv which is already placed in the same folder where Simple Linear Regression.ipynb file saved and also check the dataset what is inside the file as shown in the figure.
df = pd.read_csv('mangoes_price.csv')
df
We can also represent that data frame as a scatter plot as shown here.
%matplotlib inline
plt.xlabel('year')
plt.ylabel('mangoes_price')
plt.scatter(df.year,df.mangoes_price,color='blue',marker='.', linewidth='5')
The basic purpose of this plotting data points on a scatter plot chart to find the linear relationship between variables, if the linear relationship found between these variables then we will use the Linear Regression Model.
In this scenario, there is a linear relationship between year and mangoes_price because price of mangoes increased with the passage of time. Before creating a linear model, we will create a new data frame in which we will drop a column (mangoes_price) as the linear model except for 2-D array.
new_df = df.drop('mangoes_price',axis='columns')
new_df
Also, check the price of mangoes like this
mangoes_price = df.mangoes_price
mangoes_price
In order to train the model, we will create an object of Linear Regression class and call a fit() method like this
reg_model = linear_model.LinearRegression()
reg_model.fit(new_df,df.mangoes_price)
We will predict the price of mangoes in the year-2020 and 2021.
reg_model.predict([[2020]])
Now, we manually check the model how it is being predicted this value. Therefore, we will find the slope (coefficient) and intercept like this
reg_model.coef_
reg_model.intercept_
As we already know, y = mx + b, where, ‘m’ is a slope and ‘b’ is an intercept. Hence, after putting the values of coefficient and intercept in the above equation and obtained an equal value of one Kg mangoes in year 2020 that our model has already predicted, result shown in figure
2020*5.66666667 + (-11353.333333333334)
This means that our linear model work good, now we will check its accuracy,
reg_model.score(new_df,mangoes_price)
Woo… our model works perfectly as it provides 98.80% accuracy.
Now, we will generate a csv file (in which only year mentioned but no mangoes price) with list of mangoes price predictions, like this
year_df = pd.read_csv("year.csv")
year_df
price = reg_model.predict(year_df)
price
year_df['mangoes_price']=price
year_df
Comparison of these actual and predicted prices of manages during the last five years i.e. 2015 to 2019 are given below.
S # | Year | Actual Price of per Kg mangoes (in Rs.) | Actual Price of per Kg mangoes (in Rs.) |
1 | 2015 | 65 | 65.00 |
2 | 2016 | 70 | 70.66 |
3 | 2017 | 75 | 76.33 |
4 | 2018 | 80 | 82.00 |
5 | 2019 | 90 | 87.66 |
Lastly, we will save this result in a new csv file namely price_prediction.csv.
year_df.to_csv("price_prediction.csv")
As we already know, “Practice makes a man perfect”, therefore, we have two problem statements for you to do some exercises to get the optimum grab on this technique.
Problem Statement No.1:
You are required to build a Regression Model and predict the price of Lux Soap in the upcoming year i.e. 2020. Download the file lux_price.csv
Problem Statement No.2:
You are required to build a Regression Model and predict the per capita income of the citizens of a country in the previous years (1990 & 1994). Download the file country_income.csv