Logistic Regression

Logistic Regression is a classification algorithm and not a regression algorithm. It is used to estimate discrete values (like 0 or 1, True or False, Yes or No) based on a given set of independent variables.

Logistic Regression produces results in a binary format that is used to predict the outcome of a categorical dependent variable. So the outcome should be discrete/categorical.

Dataset

We will be using a simple dataset to implement this algorithm. This dataset contains User ID, Gender, Age, Estimated Salary and Purchased column. Purchased column has data as 0 and 1 where 1 denotes that car is purchased.

Download the dataset here.

So let’s begin here…

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

Load Data

data = pd.read_csv("suv.csv")
data.head(10)

	User ID	Gender	Age	EstimatedSalary	Purchased
0	15624510	Male	19	19000	0
1	15810944	Male	35	20000	0
2	15668575	Female	26	43000	0
3	15603246	Female	27	57000	0
4	15804002	Male	19	76000	0
5	15728773	Male	27	58000	0
6	15598044	Female	27	84000	0
7	15694829	Female	32	150000	1
8	15600575	Male	25	33000	0
9	15727311	Female	35	65000	0

print("Number of customers: ", len(data))

Number of customers: 400

Analyzing Data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
User ID            400 non-null int64
Gender             400 non-null object
Age                400 non-null int64
EstimatedSalary    400 non-null int64
Purchased          400 non-null int64
dtypes: int64(4), object(1)
memory usage: 15.7+ KB

Customers who purchased the SUV

sns.countplot(x='Purchased', data = data)

png

Customers who purchased the SUV based on Gender

sns.countplot(x='Purchased', hue = 'Gender', data = data)

png

Graph for age of customers

data['Age'].plot.hist()

png

Graph for Estimated Salary of Customers

data['EstimatedSalary'].plot.hist()

png

Customers who purchased the SUV based on Age

plt.figure(figsize = (5,5))
sns.distplot(data[data['Purchased']==1]['Age'])

png

plt.figure(figsize = (20,10))
sns.barplot(x=data['Age'],y=data['Purchased'])

png

Customers who purchased the SUV based on Estimated Salary

plt.figure(figsize = (5,5))
sns.distplot(data[data['Purchased']==1]['EstimatedSalary'])

png

plt.figure(figsize = (20,7))
sns.lineplot(x=data['EstimatedSalary'],y=data['Purchased'])

png

Preprocessing

Gender = pd.get_dummies(data['Gender'], drop_first = True)
Gender.head(5)

	Male
0	1
1	1
2	0
3	0
4	1

data = pd.concat([data, Gender], axis = 1)

Dropping User ID and Gender column

data.drop(['User ID', 'Gender'], axis = 1, inplace = True)
data.head()

	Age	EstimatedSalary	Male
0	19	19000	1
1	35	20000	1
2	26	43000	0
3	27	57000	0
4	19	76000	1

Dependent and Independent variables

X = data.drop('Purchased', axis = 1)
y = data['Purchased']

Train and Test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Define Model

model = LogisticRegression(solver = 'liblinear')

Fit Model

model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

Predictions

predictions = model.predict(X_test)

Classification Report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.87      0.83      0.85        48
           1       0.76      0.81      0.79        32

   micro avg       0.82      0.82      0.82        80
   macro avg       0.82      0.82      0.82        80
weighted avg       0.83      0.82      0.83        80

Confusion Matrix

It is a 2x2 matrix that has 4 outcomes. This tells how accurate the values are.

print("Confusion Matrix: \n",confusion_matrix(y_test, predictions))

Confusion Matrix:
 [[40  8]
 [ 6 26]]

Accuracy

print("Accuracy: ",accuracy_score(y_test, predictions))

Accuracy: 0.825