Decision Tree

Decision Tree is a type of supervised learning algorithm that is mostly used for classification problems. This works for both categorical and continous dependent variables.

A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions.

What is Classification?

Classification is the process of dividing the datasets into different categories of groups by adding labels.

Types of classification -

Decision Tree
Random Forest
Naive Bayes
KNN

Dataset

We will be using a simple dataset to implement this algorithm. This dataset contains details of patient like Age, Sex, BP, Na_to_K and Drug column. Drug column has data as drugX, drugY, drugA, drugB and drugC. Using Decision Tree we will predict what drug to be given to the patient.

Download the dataset here.

So let’s begin here…

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg

%matplotlib inline

Load Data

data = pd.read_csv('drug200.csv')
data.head()

	Age	Sex	BP	Cholesterol	Na_to_K	Drug
0	23	F	HIGH	HIGH	25.355	drugY
1	47	M	LOW	HIGH	13.093	drugC
2	47	M	LOW	HIGH	10.114	drugC
3	28	F	NORMAL	HIGH	7.798	drugX
4	61	F	LOW	HIGH	18.043	drugY

Dependent and Independent variables

X = data[['Age','Sex','BP','Cholesterol','Na_to_K']].values
y = data[['Drug']].values # Dependent variable

le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1])

le_BP = preprocessing.LabelEncoder()
le_BP.fit(['LOW','NORMAL','HIGH'])
X[:,2] = le_BP.transform(X[:,2])

le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

Train and Test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Define Model

model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

Fit Model

model.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Prediction

predTree = model.predict(X_test)

Accuracy

print("Decision Trees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

Decision Trees’s Accuracy: 0.9666666666666667

Decision Tree

dot_data = StringIO()
filename = "drugtree.png"
featureNames = data.columns[0:5]
targetNames = data["Drug"].unique().tolist()
out=tree.export_graphviz(model,feature_names=featureNames, out_file=dot_data, class_names= np.unique(y_train), filled=True,  special_characters=True,rotate=False)  
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize=(100, 200))
plt.imshow(img,interpolation='nearest')

png

If the relationship between dependent and independent variable is well approximated by a linear model then linear regression will outperform tree based model.

If there is high non linearity and complex relationship between dependent and independent variables a tree model will outperform a classical regression model.