Logistic Regression is a classification algorithm and not a regression algorithm. It is used to estimate discrete values (like 0 or 1, True or False, Yes or No) based on a given set of independent variables.

Logistic Regression produces results in a binary format that is used to predict the outcome of a categorical dependent variable. So the outcome should be discrete/categorical.

Dataset

We will be using a simple dataset to implement this algorithm. This dataset contains User ID, Gender, Age, Estimated Salary and Purchased column. Purchased column has data as 0 and 1 where 1 denotes that car is purchased.

Download the dataset here.

So let’s begin here…

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

Load Data

data = pd.read_csv("suv.csv")
data.head(10)
User ID Gender Age EstimatedSalary Purchased
0 15624510 Male 19 19000 0
1 15810944 Male 35 20000 0
2 15668575 Female 26 43000 0
3 15603246 Female 27 57000 0
4 15804002 Male 19 76000 0
5 15728773 Male 27 58000 0
6 15598044 Female 27 84000 0
7 15694829 Female 32 150000 1
8 15600575 Male 25 33000 0
9 15727311 Female 35 65000 0
print("Number of customers: ", len(data))

Number of customers: 400

Analyzing Data

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
User ID            400 non-null int64
Gender             400 non-null object
Age                400 non-null int64
EstimatedSalary    400 non-null int64
Purchased          400 non-null int64
dtypes: int64(4), object(1)
memory usage: 15.7+ KB

Customers who purchased the SUV

sns.countplot(x='Purchased', data = data)

png

Customers who purchased the SUV based on Gender

sns.countplot(x='Purchased', hue = 'Gender', data = data)

png

Graph for age of customers

data['Age'].plot.hist()

png

Graph for Estimated Salary of Customers

data['EstimatedSalary'].plot.hist()

png

Customers who purchased the SUV based on Age

plt.figure(figsize = (5,5))
sns.distplot(data[data['Purchased']==1]['Age'])

png

plt.figure(figsize = (20,10))
sns.barplot(x=data['Age'],y=data['Purchased'])

png

Customers who purchased the SUV based on Estimated Salary

plt.figure(figsize = (5,5))
sns.distplot(data[data['Purchased']==1]['EstimatedSalary'])

png

plt.figure(figsize = (20,7))
sns.lineplot(x=data['EstimatedSalary'],y=data['Purchased'])

png

Preprocessing

Gender = pd.get_dummies(data['Gender'], drop_first = True)
Gender.head(5)
Male
0 1
1 1
2 0
3 0
4 1
data = pd.concat([data, Gender], axis = 1)

Dropping User ID and Gender column

data.drop(['User ID', 'Gender'], axis = 1, inplace = True)
data.head()
Age EstimatedSalary Purchased Male
0 19 19000 0 1
1 35 20000 0 1
2 26 43000 0 0
3 27 57000 0 0
4 19 76000 0 1

Dependent and Independent variables

X = data.drop('Purchased', axis = 1)
y = data['Purchased']

Train and Test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Define Model

model = LogisticRegression(solver = 'liblinear')

Fit Model

model.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

Predictions

predictions = model.predict(X_test)

Classification Report

print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

           0       0.87      0.83      0.85        48
           1       0.76      0.81      0.79        32

   micro avg       0.82      0.82      0.82        80
   macro avg       0.82      0.82      0.82        80
weighted avg       0.83      0.82      0.83        80

Confusion Matrix

It is a 2x2 matrix that has 4 outcomes. This tells how accurate the values are.

print("Confusion Matrix: \n",confusion_matrix(y_test, predictions))
Confusion Matrix:
 [[40  8]
 [ 6 26]]

Accuracy

print("Accuracy: ",accuracy_score(y_test, predictions))

Accuracy: 0.825