Imagine that you have a dataset with a list of predictors or independent variables and a list of targets or dependent variables. Then, by applying a decision tree like J48 on that dataset would allow you to predict the target variable of a new dataset record. Decision tree J48 is the implementation of algorithm ID3 (Iterative Dichotomiser 3) developed by the WEKA project team. R includes this nice work into package RWeka. Let’s use it in the IRIS dataset. Flower specie will be our target variable, so we will predict it based on its measured features like Sepal or Petal length and width among others. > install.packages(RWeka) > install.packages(party) # Load both packages > library(RWeka) > library(party) # We will use dataset 'IRIS' from package 'datasets'. It consists of 50 objects from each of three species of Iris flowers (Setosa, Virginica and Versicolor). For each object four attributes are measured length and width of sepal and petal. > str(iris) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... # Using the decision tree ID3 in its J48 weka implementation, we want to predict the objective attribute "Species" based on attributes length and width of sepal and petal. > m1 <- J48(Species~., data = iris) > if(require("party", quietly = TRUE)) plot(m1)
How do we read this output ?Attribute Petal width is the one that contains much more information and for this reason it has been selected as the first split criteria. Reading the graphic we can notice that if Petal Width is less than 0.6 then there are 50 out of 150 objects that fall in Setosa. Otherwise if Petal Width is greater than 1.7 then there are 46 out of 150 objects that fall in Virginica ... and so on. # To get the confusion matrix of J48 algorithm, type the following code. > summary(m1) === Summary === Correctly Classified Instances 147 98 % Incorrectly Classified Instances 3 2 % Kappa statistic 0.97 Mean absolute error 0.0233 Root mean squared error 0.108 Relative absolute error 5.2482 % Root relative squared error 22.9089 % Coverage of cases (0.95 level) 98.6667 % Mean rel. region size (0.95 level) 34 % Total Number of Instances 150 === Confusion Matrix === a b c <-- classified as 50 0 0 | a = setosa 0 49 1 | b = versicolor 0 2 48 | c = virginica Confusion Matrix is telling the following:
The decision tree has taken value 0.6 for Petal Width as split criteria. Why? | How a decision tree works internallyBehind the idea of a decision tree we will find what it is called
information gain , a concept that measures the amount of information contained in a set of data. It gives the idea of importance of an attribute in a dataset.# The information gain calculation will answer the question of why the algorithm has decided to start with attribute Petal.Width.
> library(FSelector) > information.gain(Species~., data = iris)
# Let's go further in this study. We will take a subset of IRIS which contains only objects with attribute Petal.Width > 0.6 and we will get the information gain of this subset.
> subset1.iris <- subset(iris, Petal.Width > 0.6) > information.gain(Species~., data = subset1.iris)
Once again Petal.Width is the attribute which contains much more information and that is the reason why the second leave of the tree starts from the attribute Petal.Width.
# Next step takes us to calculate the information gain of the subset which contains only objects with attribute Petal.Width <= 1.7
> subset2.iris <- subset(subset1.iris, Petal.Width <= 1.7) > information.gain(Species~., data = subset2.iris) attr_importance Sepal.Length 0.0000000 Sepal.Width 0.0000000 Petal.Length 0.2131704 Petal.Width 0.0000000 This time Petal.Length is the attribute with the highest information gain. In summary, Information gain is the mathematical tool that algorithm J48 has used to decide, in each tree node, which variable fits better in terms of target variable prediction.Pruning the decision treeWe will develop the importance of error tolerance parameter and the pruning concept in decision trees. __________ ) ) ) / (o o) < still working ! ooO--(_)--Ooo \__________ |
Data Mining with R >