Decision trees are a common technique used in data mining to predict a target value based on several input data. Prediction of output value involves testing of input sample on certain rules. Each terminal node of the tree represents the output to which sample it belongs. To figure out the output, we start at the root node of the tree, and ask a sequence of questions about the features. The interior nodes are labeled with questions, and the edges or branches between them labeled by the answers and based on the attributes you eventually end in a particular leaf.
Essentially this is how the testing works but the task is, How we create this decision tree? What attributes shall we test first?
To answer the aforementioned questions we need to first figure out whether a node in the decision tree is helpful or not. So we take up the concept of entropy to sort this out.
Consider two coins are flipped a couple of times
Coin 1 -HTHTHHHTTT
Coin 2- THTHTTHHHH
We need to determine how much information we get after flipping a coin at once. Basically we are talking about two sequences because we want to have some kind of relationship between these two events. Information applied to the joint probability of those two independent events have to be same as the sum of the information gain due the two events. So with a fair coin flip it’s information of 0.25(joint event) should be same as information of 0.5 and information of 0.5.
Entropy is the unpredictability in a set of events and is defined as the expected value of information. It can be understood as the average value of information calculated by multiplying information of an event with probability of that event and summing it over all possible events.
Hence entropy for a flip of a coin is = 0.5 log2(2) + 0.5 log2(2)=1
So after flipping of a coin we get one bit of information.
So in Decision trees we put that feature on top which reduces the entropy the most. Let’s understand this with an example, you need to predict that whether the guy would go to play outside or not based on the following data.
Let us see how much information does outlook provide to us.
Let us see what is the role of temperature in determining the output.
Similarly entropy can be calculated for humidity and windy. Now let’s compare the results:-
We can see that outlook reduces the entropy the most hence we’ll choose that feature first for classification. And in the same fashion order of other features is determined. These are the fundamentals of how a decision tree is made in machine learning.
If you want to know more on how decision trees can improve regression models click here.
Reference-: Introduction to Data Science: Coursera