The decision tree is a machine learning algorithm that asks a series of questions in order to narrow in on a prediction for a given data point. Easiest way to understand the decision tree is to look at an example. So let's say we wanted to create a classification model to classify four types of animals dogs, lizards, birds, and moose. And we want to do this by asking a series of questions about each of these animals to determine which animal it is. We might start by asking, does the animal have horns? Of the four possible animal classes that we have, we know that there's only one class that has horns, which is a moose. So if the answer to our question is yes, we can predict that the animal is a moose. However, if the answer to our question is no, we're not yet sure what animal it is. It could be a dog, a lizard, or a bird. So we then ask a second question, how many legs does the animal have? Does it have two legs or four legs? If the animal has two eggs, we can predict it's a bird. If the animal has four legs, we're still not sure. It could be a dog, but it could also be a lizard. So let's ask one more question. What color is the animal? If it's green, we can predict that the animal is a lizard, and if it's brown, we can guess that the animal is a dog. By asking the series of questions, we've now formed the decision tree. And so if we were to take a new animal, we could map it through the tree and we could compute predicted class as the output based on where it falls within the tree. So how do we choose the splits that form a decision tree. Our goal is to build the most efficient tree or the one that uses the minimum number of splits to effectively split the data into our target classes. To choose the splits, we define an objective function to help us select which split is the best one. And the objective function that we use is maximizing the information gain at the split. The information gain is equal to the decrease in impurity by splitting our data. So impurity means how well mixed our data is at any point within our tree. If I had a certain node in our tree, our data is highly mixed between two classes, let' say class A and class B. Our data has a high degree of impurity. If we create a split that effectively splits our data between A and B, such that one leaf, we have labels that are entirely class A, and that the other leaf we have data that's entirely class B. We've reduced our impurity all the way to zero. So we've created a pretty significant decrease in impurity. Or put another way, we've successfully increased the information gain coming from that split. So the idea of creating a decision tree is to find questions or splits that can reduce the mixture of the data or effectively separated out into the individual classes. When we're creating a split, we look at every possible way that we could split our data at that point in the tree. So we look at each of the features on which we could split. And for each of those features we look at different possible values that we could split on. And we choose the combination of the feature and the value to split on the results in the maximum information gain, or the biggest decrease in impurity by splitting on that feature and value. Once we've created the tree, how do we actually generate predictions out of the tree? The bottom nodes in a tree are called the leaves of the tree. To calculate the actual prediction or value for all the points which occur at each leaf of the tree, we generally take an average if we're working with a regression model, or a majority vote if we're working with a classification model. So let's say we have some data that's mixed at a certain node between class A and class B. We make a split and we split that down into two leaves. One leaf has a majority of class A, the other leaf has a majority class B. The prediction for that leaf is whichever class has a majority. So leaf 1, the prediction for all the data points at that leaf would be Class A. And for leaf 2, because it's majority Class B, we make a prediction for every point which arrives at that leaf would be predicted to be Class B. One of the key things that we need to determine when we're creating a tree is the optimal depth of our tree. And this can really make a huge difference in terms of the prediction ability of your tree. So the depth of a tree is the maximum number of splits that occur within that tree. And it's actually a number that we can choose. We can decide to create a very shallow tree by limiting ourselves to at most one or two splits in our tree before we create the leaves. Or we can have an unlimited number of splits in our tree, such that every leaf contains only one single point. Very shallow trees with a few number of splits tend to underfit the data. They're just too simple to really capture the patterns within the data and effectively split your data. On the other hand, trees that are very deep tend to overfit the data, because every single example or observation can end up at its own leaf. This might fit your training data set very well, but when you try to use this to generalize on new data, you'll find it's overfitting and it's not performing very well. Let's take an example to illustrate the impact that tree depth has on the complexity of the model and the resulting outputs it's able to generate. On the left side of this slide, we have a set of data organized along two features, x1 shown on the horizontal axis, and x2 shown on the vertical axis. Our data is labelled in the four classes, which are denoted by the color shading of the data points. Let's now try to fit a simple tree model to classify our data. We'll start by using a tree depth of one, meaning that we only have a single node or split in our model. We could see the result on the right side of the slide. Our single node tree model is using a single split on the value of x2 to split their data. Because it's using only a single split, it can split the data into only two classes. In reality we have four classes in our data set. And so are our simple model using only a single split is underfitting or data by predicting only two classes relative to the actual four classes that we have in our problem. If we now start to increase the depth of our tree, we can draw more complex decision boundaries, splitting our data along x1 and x2 as denoted by horizontal and vertical lines. And as a result, we can differentiate our data points and split them into more classes. As we increase our depth to two and then three, you can see that now we're starting to be able to better capture the variability and the split of the data into each of the four classes. As we continue to increase the complexity of our model and add more and more layers, we can see that we're now slicing and dicing our decisions space into many more partitions. Well, this may improve the accuracy on a training data. What happens is that when we now move to a test data set, or we use a more complex model to generate predictions on new data. We fitted our model so tightly to the training set that we've created partitions based on noise that's found in the training set. The same noise is not always found in our test set or a new data. And as a result, our model is fairly inflexible and often does not generate great performance on predicting new data. We can also use trees for regression types of problems. In a regression problem, rather than taking a majority vote of the different samples which fall at a leaf, we take the mean of the target values of each of the samples at that leaf. So let's say we have a particular node that results in two leaves. Leaf 1 has 4 samples that fall at that leaf of 5, 9, 8, and 6. And leaf 2 has 3 samples 4, 2, and 3. To generate the prediction for the samples that fall at leaf 1, we add up the target values of the four samples, divide by the number of target values are four. And we calculate a prediction of seven, which is the predicted value for every sample which falls to this leaf in the tree. Likewise for leaf 2, we can calculate an average value of three. And so our prediction for every sample that falls to this leaf in the tree based on the splits in the tree is three. One of the key benefits of decision tree models is that they're highly interpretable. Because of this series of questions or splits, it's very easy to follow the order of the questions and to trace back how we got to a certain prediction given an input value. They also trained very quickly, and because they're a nonparametric model, meaning they're not constrained to any specific template function, they can handle non-linear relationships very well. They also don't require scaling of our data or extra work encoding categorical variables before we feed them into our model. One of the challenges of individual decision tree models is that they're highly sensitive to the depth that we choose to grow our tree. If we choose a small depth, we end up with a very simple model. It really doesn't do a very good job predicting on either or training or a test set data. One of the bigger problems is choosing a depth that's too deep, such that our model performs very well on the data on which it's been trained. But our model is actually overfit itself to that training data. And so when we try to use it to generalize and create predictions on new data, really doesn't do a very good job.