In this article, we are going to see various numbers which we get as output after we implement a decision tree.
Before you can start reading this blog we expect you to have these prior understanding about how decision tree works, idea about categorical and numerical features, must know what is classification and what is regression in machine learning, and most importantly you need to know about information gain (entropy) and concepts about gini index/ impurity.
This article is not to teach you the entire concept about the decision tree, but it is to analyze the outputs after the decision tree and try to understand or replicate those numbers in the output.
Here the features are quite self-explanatory, in the original data source the name of the column term_deposit is y, but for our clear understanding, I have given the name of that feature as term_
deposit. By looking at this data we need to figure out based upon the features age, job, marital status whether a customer is going to open a term_deposit account with the bank or not.
By using all the data (only 10 records but it will be good for our explanation purpose), we have built a decision tree classifier and we have plotted the tree using an open-source graph/visualization tool called graphviz. We have used decision tree classifier from sklearn.tree.
We will use the gini impurity as the splitting criteria for our decision tree classification, and the formula for that is given below.
[1 – Sum(pi^2)], where i = 1 to n
Now the question in hand is how it decides how to split on which feature, and what is the gini score mentioned in the nodes. We will try to analyze each and every point in this output in detail.
In the beginning, it measures the overall impurity of the data and then try to decide which feature is going to divide the data into homogenous groups, so after the division, there will be less ambiguity for us to decide whether it will be class 0 or 1 (here we have considered if term_deposit is no then class 0 else class 1).
Initially, we have 10 records and out of that 5 records are with class 0 and 5 records are with class 1.
Now if we consider just to look at the data and try to guess if a new customer walks in and we need to guess whether she/he is going to open a term deposit account or not, it will be kind of a coin toss experiment, with equal probability for both head and tail (considering coin is not biased). So, the gini score of .5
If we want to calculate the impurity using gini values we can do it as given below.
1 – ((5/10)^2+(5/10)^2) = .5
5/10 = Because there are a total of 10 records and out of that 5 records have classes 0 and 5 have class 1.
Because of that on the root node, we have a gini score of .5
The next big question is how it will decide which of the features it will use for the first split. The next section is going to describe that. The Decision tree handles numerical and categorical features in different ways. Let us try to see for the numerical feature first (in our case Age is the numerical feature).
First, it sorts the numerical values in ascending order and then takes the average of consecutive values.
Please find the description below, we have sorted our numerical features in ascending order.
33, 33, 41, 42, 44, 47, 54, 55, 56, 58
After that, it calculates the gini score for all those average values.
I will show how it calculates for two values 33 and with the same process, you can calculate for the rest of the average values.
Out of the 10 records, there are two records for which Age is less than or equal to 33. For both these records in our data set class is 0. Now if we need to calculate gini values for them it will be like given below.
(2/2)^2 = 1
Now there are 8 records for which Age is greater than 33 and out of that 3 records have class 0 and 5 records have class 1.
(3/8)^2 = 0.140625
(5/8)^2 = 0.390625
Now if we will add these values ((3/8)^2+(5/8)^2) = 0.53125
And gini value for the records for which Age is less than or equal to 33 is 1. Now we are going to multiply the values with the number of occurrences of each class out of the total number of records. ((2/10)*1+(8/10)*.53125) = 0.625
In the same way, we can calculate for all the values for Age, you can see that in the section given below.
Age – 33
Age – 37
Age – 41.5
Age – 43
Age – 45.5
Age – 50.5
Age – 54.5
Age – 55.5
Number of records for Age <= 33 and Age <=37 is the same so their gini score also matches.
Now let us try to see how it calculates these scores for categorical features. I will explain for one feature and you can calculate for the rest.
Let us see the categorical feature job which has the following levels/categories.
Admin, blue collar, entrepreneur, management, services, technician and unknown.
So, for each of these categories, it is going to calculate the gini scores. Let us see for the admin category/level.
There is only one record with the admin job category and that belongs to class 1.
Out of the other 9 records, 5 belongs to class 0 and 4 belongs to class 1.
So ((5/9)^2+(4/9)^2) = 0.50617284
Then for the admin category in job 1 record for class 1 so ((1/1)^2) = 1
Now we will multiply individual class occurrences out f total number of records,
((9/10)* 0.50617284 +(1/10)*1) = 0.555555556
Let us see for all the other levels/categories in the job feature.
In the same way for the feature marital
So now we have seen gini score for al the features and from there we see the gini value of Age <= 37 is the highest with .625, so it will have the lowest impurity 0.375 (1 – .625), and it will be the feature-based upon which we will have the first split ( So now we see how we can calculate the impurity with gini scores).
If you refer the figure of the tree, we can see the first split was based upon Age <= 37, and there is the value in the right child which shows the value of the gini=.469
Now let us try to see from where it gets that value. Now for the left child in the above figure, we can see that there are two records for Age <= 37 and both of them belongs to class 0 (So that is a leaf node, a pure or homogeneous node on which no further split can happen), but for the other 8 records, 3 records are with class 0 and 5 records with class 1.
1 – [(3/8)^2+(5/8)^2] = .46875 (round it up to 3 decimal places it is .469)
So good going !!!!! till now we are getting all the values in our decision tree matching.
Let us try to see one more split, and apply our same logic and see if the values are matching or not.
Now we see that the next split was based upon Age <=57.0. If we will filter our data with the first split criteria (Age > 37) then we will have 8 records and that will form our base for the next step (just consider these 8 records as your new training set or data set). We have only calculated for the category Age-57 from the Age feature.
Age – 57
Job – Blue collar
Job – Entrepreneur
Now we can see that for the feature Age <= 57 gini score is the highest (so impurity is lowest) and the next split will happen with Age <= 57. (Which is the case if we see the figure given above).
There is only one record with Age > 57 and its class is 0 which we can see from the right child node.
Now in the left child node, we can see that gini scores with 0.408, and that is because there are 7 records with Age <= 57 and out of which 2 are for class 0 and 5 are for class 1.
So 1 – [(2/7)^2+(5/7)^2] = 0.408163 (If we round it up to 3 decimals we got the number matching with the above figure of .408)
With this, we can see that we can match the calculation which we got as the output from the decision tree classifier. In the same way, you can calculate and get the values for the entire tree (as in the tree figure we have shown at the beginning of our article)
Future Reading: We will try to come up with the same kind of blog for decision tree based upon information gain and also calculations for one of the most important features of decision tree i.e feature importance.